Anda di halaman 1dari 160

A STUDY ON THE SPEECH

ACOUSTIC-TO-ARTICULATORY MAPPING USING


MORPHOLOGICAL CONSTRAINTS

a dissertation
submitted to the graduate school of engineering
of nagoya university
in partial fulfillment of the requirements
for the degree of
doctor (engineering)

By
Hani Camille Yehia
Abstract
The representation of speech based on articulatory parameters provides a fertile
paradigm for a better modeling of the speech process. This modeling is important,
for example, for the development of applications, such as speech synthesis, coding
and even recognition, whose performance is directly related the method used to rep-
resent speech. However, articulatory representation of speech is a goal that, to be
achieved, still requires the solution of several problems. A key issue in this context is
the inversion of the articulatory-to-acoustic mapping in speech. This study is focused
on this point.
The articulatory-to-acoustic mapping in speech is de ned here as the mapping
that, to every possible articulatory con guration of the vocal-tract, associates a
unique acoustic image. This image is de ned as the set of acoustic properties inherent
in the vocal apparatus. The spaces formed by all possible articulatory con gurations
and by their acoustic images are called articulatory space and acoustic space respec-
tively. The articulatory space maps onto the acoustic space, i.e. every point in the
articulatory space has a unique image in the acoustic space, and every point in the
acoustic space has an inverse image in the articulatory space. However, since the
inverse image is not unique, the map is not a bijection (i.e. it is not a one-to-one
mapping).
The objective of this study is to estimate a restricted case of the acoustic-to-
articulatory mapping, using constraints imposed by the human morphology and by
the dynamics of the vocal-tract to determine the point in the articulatory space most
likely to map onto a given point contained in the acoustic space. In the restricted
case under analysis, only oral vowels are considered. The vocal-tract is represented
articulatorily by the log-area function (i.e. the logarithm of the cross-sectional area
along the vocal-tract) and acoustically by the set formed by the rst three formant
ii
frequencies (i.e. resonant frequencies of the vocal-tract). The area function was chosen
to represent the vocal-tract articulatorily because it is the articulatory characteristic
most directly related to the acoustic characteristics of the vocal-tract. By its turn,
the set formed by the rst three formant frequencies was chosen to represent the
vocal-tract acoustically because it is the acoustic characteristic of the vocal-tract
least in uenced by factors other than the area function.
A major diculty in estimating appropriately the area function from formant
frequencies is the one-to-many characteristic of the acoustic-to-articulatory mapping:
there are many area functions that map onto the same set of formant frequencies.
The main contribution of this study is the formulation of a framework to incorpo-
rate constraints imposed by the human morphology into the mapping of the articu-
latory space formed by all possible area functions onto the acoustic space formed by
all possible sets of formant frequencies. (The morphological information is extracted
from a corpus of midsagittal vocal-tract pro les obtained with cineradiography.) By
doing so, the articulatory space is limited to contain only area functions compati-
ble with the human morphology and, consequently, the ambiguity observed in the
acoustic-to-articulatory mapping is reduced. Subsequently, positional and continuity
constraints are added to complete the model. Tests carried out con rmed that, under
the constraints of the model proposed, plausible sequences of area functions can be
estimated from sequences of formant frequencies.
The procedure followed to formulate and evaluate the method proposed is de-
scribed step by step along the chapters of the study. Chapter 1 is the introduction.
Chapter 2 describes the corpus and the physical model used to represent the relation
between the area function and formant frequencies. Chapter 3 explains the paramet-
ric models used to represent the area function. Chapter 4 shows the procedure used to
estimate the area function from formant frequencies under morphological, continuity
and positional constraints. Chapter 5 presents and interprets the results obtained
with the tests carried out with the model. Finally, Chapter 6 concludes the study
summarizing the main points of the previous chapters, pointing the strong and weak
points of the method, and indicating directions in which research e orts should be
carried out in the future. A brief description of each chapter is given in the following
paragraphs.

iii
Chapter 1 gives an overview of the context in which the problem analyzed is
inserted, presents the problem and its importance, summarizes the history and al-
ternative methods available in the literature, explains in general lines the method
proposed to solve the problem, and describes the general organization of the remain-
ing chapters.
The rst part of Chapter 2 describes the procedure used to obtain a corpus of 519
area functions from midsagittal tracings extracted from cineradiograhic data and from
labiograms synchronously acquired at a rate of 50 frames per second. The process
to obtain the area functions from the raw data is as follows: rst, each midsagittal
pro le is plotted on a semi-polar grid which follows approximately the orientation
of the tube de ned by the vocal-tract walls. The grid lines divide the vocal-tract
into sections. After that, each section is represented by its mean length and by its
mean midsagittal distance, i.e. the mean distance between anterior and posterior
walls of the tract. In the next step, the natural logarithm of the mean midsagittal
distance of each section is converted into the natural logarithm of the cross-sectional
area by means of a linear transformation. (This transformation is, however, only an
approximation because, although strongly associated with, the cross-sectional area is
not completely determined by the midsagittal distance.) At this point, each vocal-
tract shape is represented by a set of log-areas (i.e. logarithms of cross-sectional
areas) and corresponding section lengths. The nal step is to resample the log-areas
so that they become evenly spaced along the tract. By doing so, each vocal-tract
shape can be represented as a vector of log-areas plus the vocal-tract length. The
corpus is then arranged in a matrix whose rows contain the log-area vectors. This
matrix provides an ecient way to handle the corpus of area functions throughout
the study.
The second part of Chapter 2 explains the physical model used to calculate for-
mant frequencies from the area function. At rst, the vocal-tract is modeled as a
rigid, lossless tube. After that, viscous losses, glottal opening, lip radiation load and
yielding walls are incorporated into the sound propagation model, and their e ects
on the formant frequencies are analyzed. Finally, a numerical procedure to compute
formant frequencies from the area function, approximated by a concatenation of uni-
form tubes, is described. This procedure is the tool used in this study to express
formant frequencies as a function of the area function (or, more speci cally, of the
iv
log-area function.)
The objective of Chapter 3 is to obtain an ecient parametric representation of
the log-area function. The number of uniform sections necessary to obtain a good
approximation of the vocal-tract log-area function (32 sections in this study) is consid-
erably larger than the number of dimensions of the articulatory space. More ecient
representations can be achieved if each log-area function is expressed as the sum of a
few basis functions weighted by a set of coecients. Since the basis functions (basis
vectors, in the discretized case) are xed, only the coecients are enough to represent
a given log-area function (log-area vector, in the discretized case.) Two possibilities
of parametric representation are used in this work: Fourier analysis, and principal
component analysis (PCA).
In the rst part of Chapter 3, a reasonably good approximation of the log-area
function is obtained when it is represented by the rst eight terms of its Fourier
cosine series expansion. In this case, the set of basis functions is formed by cosine
functions. An important property of this representation is that there is a one-to-
one relationship between the rst three odd coecients of the Fourier cosine series
expansion of the log-area function and the rst three formant frequencies determined
by it. This property is used in the method proposed here to determine the area
function most likely to have produced a given set of formant frequencies.
Cosine functions form a set \general purpose" basis functions which, in principle,
do not contain any information about the morphology of the vocal-tract. In the
second part of Chapter 3, a more ecient representation of the log-area vector (from
now on, the description will be done for the discretized case), which makes use of
such information, is obtained by means of principal component analysis (PCA). In
this case, the set of basis vectors is formed by the eigenvectors associated with the ve
largest eigenvalues of the covariance matrix of log-area vectors, which is estimated
using all vectors present in the corpus described in Chapter 2.
After the PCA procedure is carried out, some transformations are still necessary
to nd a representation for the log-area vectors which exhibits a one-to-one mapping
between a subset of its components and the corresponding sets of formant frequen-
cies (as in the case of Fourier representation.) In this procedure, rst, independent
component analysis (ICA) is used to nd ane transformations which reduce the de-
pendence among the principal components used to represent the articulatory space,
v
and among the formant frequencies (represented in log-scale) used to represent the
acoustic space. Next, singular value decomposition (SVD) is used to nd rotations of
both articulatory and acoustic spaces so that each component of the acoustic space
is subject to the major in uence of one and only one component of the articulatory
space. By its turn, each component of the articulatory space has major in uence on
at most one component of the acoustic space.
When either PCA or Fourier representation is used, it is observed that the mapping
of the articulatory space onto the acoustic space, albeit nonlinear, has a strongly linear
characteristic. The Fourier representation is useful to study analytical aspects of the
articulatory-to-acoustic mapping, whereas the PCA representation is optimal from a
statistical point of view.
The last part of Chapter 3 takes advantage of the fact that the vocal-tract moves
continuously in time, and so do articulatory parameters. Based on this fact, it is
possible to think about the parametrization of trajectories of articulatory components
and form the concept of an articulatory trajectory space. This procedure is successfully
carried out using either Fourier or PCA approach. In this temporal parametrization
the basis functions (of time) obtained with PCA have basically the shape of cosine
functions. In the tests realized, sequences of ten frames were well represented as linear
combinations of four basis functions.
At the end of Chapter 3, sequences of log-area vectors can be represented in
a very compact parametric form. As a quantitative example, the log-area vectors
contained in a sequence of 10 frames (200 ms of speech) can be represented by only
20 parameters.
Chapter 4 focuses on the main objective of this work: the inversion of the articulatory-
to-acoustic mapping or, equivalently, the acoustic-to-articulatory mapping. This is,
as already stated, a one-to-many mapping, since the same set of formant frequencies
can be produced by an in nite number of area functions. In other words, a given point
in the acoustic space is associated with an entire subspace in the articulatory space.
The number of dimensions of this subspace is equal to the number of dimensions of
the articulatory space minus the number of dimensions of the acoustic space.
Among all the points contained in the articulatory space mapping onto a given
point in the acoustic space, it is possible to look for the point that can be reached with
minimum e ort by the vocal-tract. This procedure is carried out by representing the
vi
vocal-tract articulatory e ort as a quadratic cost function and, subsequently, nding
the point of minimum cost among all points in the articulatory space that map onto
a given point in the acoustic space. The same procedure can be used in the case
of articulatory trajectories. In this case, the problem is to nd the trajectory of
minimum cost among all points in the articulatory trajectory space that map onto a
given point in the acoustic trajectory space.
The mathematical formulation of this problem results in a non-linear system which
is numerically solved using a Newton-Raphson procedure. A stable solution was al-
ways achieved, taking three iterations on average to converge. At the end of Chapter 4
the method used to estimate area functions from formant frequencies is complete.
In Chapter 5 the method is tested and the results obtained are analyzed. The
258 area vectors of the corpus that correspond to oral vowels were used for this
purpose. Isolated articulatory positions were analyzed rst. In this case, continuity
constraints are not imposed. Using the numerical procedure described in Chapter 2,
the rst three formant frequencies were extracted from the transfer functions of each
area vector under analysis. These sets of formant frequencies were then used to
estimate the area vectors from where they had been extracted. A comparison of the
original area vectors with those estimated from the formant frequencies shows that the
inversion procedure works satisfactorily for most of the analyzed cases. Nevertheless,
despite the good agreement observed between practically all the acoustic transfer
functions derived from original and estimated vectors; large articulatory distortions
were observed in some cases. Part of these distortions was considerably reduced
when continuity constraints were imposed, i.e. when articulatory trajectories were
estimated instead of isolated articulatory positions. The distortions that remained
can be mainly attributed to the quadratic cost function adopted to represent the
vocal-tract articulatory e ort. The quadratic cost allows an ecient mathematical
solution, but does not have a physiological meaning.
Still in the scope of Chapter 5, it is interesting to note that, since the obtained
transfer functions were derived only from the formant frequencies, and since a good
spectral matching was obtained; it is possible to say that, if morphological information
is available, it is possible to estimate the vocal-tract transfer function from the formant
frequencies. It remains to be shown if the human being makes use of such redundancy
and, if so, in what way.

vii
Chapter 6 concludes the study. In summary, it is possible to say that the use
of morphological, positional, and continuity constraints can be eciently combined
in the analysis of the acoustic-to-articulatory mapping during speech. A method
to combine these constraints was described and tested. Correlation coecients of
0.83 were found in the articulatory domain. In the acoustic domain, the correlation
coecients found were above 0.999, con rming that the acoustic constraints were
respected.
As a nal note about the future, there is still a lot to be done to improve the
model. A physiologically meaningful function to measure the cost of vocal-tract
positions must be obtained to reduce the discrepancies observed. Also important is
the development of a better representation of the acoustic space, so that the model
can be applied to sounds other than oral vowels.

viii
Acknowledgements
This work would not be concluded without the support received from several people.
First of all, I would like to thank Prof. Fumitada Itakura, my PhD advisor, for
guiding me through the PhD course, for teaching me many important things, and for
giving me the freedom necessary to learn several more in Nagoya University.
I would like to thank also Prof. Noboru Ohnishi, for analyzing and discussing the
contents of this study. In the same way I thank Prof. Kazuya Takeda for interest-
ing comments and suggestions about this work, and for helping me with important
problems that I would not be able to solve alone.
In Nagoya University I appreciated very much to study together with Shoji Kajita,
who was always by my side during the PhD program, helped me every time I needed,
and divided with me all happy and dicult moments along these years. I would like to
thank also Hong Wang, now at PictureTel, who was always a good friend during the
years we were together in Nagoya; and Motohiko Yada, from whom I always received
all possible cooperation and incentive.
I cannot forget the guidance received previously from Osvaldo Catsumi Imamura
and Prof. Fernando Toshinori Sakane, my MSc advisors, and from Prof. Marcos
Botelho, my personal counselor, who taught me how to learn during the time I was
at the Instituto Tecnologico de Aeronautica in Brazil.
Also fundamental for this work was the support received from Shinji Maeda and
Rafael Laboissiere, from whom I received most of the data used as base for this
research, and with whom I had several fruitful discussions.
I am grateful for having had the opportunity of carrying out experiments at NTT
Basic Research Laboratories in Atsugi, Japan. There I had the chance of working
with several good scientists (and persons). In special, I would like to thank Masaaki
Honda, Tokihiko Kaburagi and Takemi Mochida, for their perfect collaboration during
ix
the time I was at NTT.
Currently, I am with ATR-HIP (in Kyoto, Japan), where I could nd an incredibly
fertile environment for scienti c studies. In special, I would like to thank Yoh'ichi
Tohkura for his constant support and encouragement, Eric Bateson and Mark Tiede
for the several discussions we had together, and for the cooperation on the devel-
opment of our research goals, Tatsuya Nomura and Shinobu Masaki for translating
the abstract of the thesis into Japanese, and Erik McDermott for sweating over his
dissertation at the same time I was sweating over mine.
I cannot forget also my friends in Brazil, Japan, and all over the world, who were
always ready to collaborate when I asked, and even when I did not ask. Friendship
is undoubtedly the most precious thing existent in the world. I hope we have the
opportunity to help each other many other times in the future.
Finally, but most importantly, I am extremely grateful to all members of my
family for all the orientation, incentive and support that they have been giving me
all through my life. In special, I thank my wife, Ana Helena da Costa Fragoso Yehia,
for the years of my life that she made happier; my parents, Camille Hani Yehia and
and Tamira Hamdan Yehia, for their unconditional encouragement since I was born;
and my sister Aline, my brother Salim, and my aunt Badiah, who were also by my
side during all these years.
I apologize for not citing explicitly all the names I wanted to, including the scien-
tists whose works formed the theoretical base for this study. I hope I can return at
least partially all the things I have received through these years.

x
List of Symbols
x Distance from the glottis along the tract.
A(x) Cross-sectional area function.
d(x) Midsagittal distance.
(x) Proportional coecient used in model.
(x) Exponential coecient used in model.
yi Log-area vector of frame i.
Aki Area of the k-th uniform section used to approximate the area function
of frame i.
Li Vocal-tract length of frame i.
yki Natural logarithm of Aki, k = 1 : : : K . yKi = Li.
P Number of vectors present in the corpus.
K Number of uniform sections used to approximate the area function.
P Amplitude of the sound pressure.
U Amplitude of the volume velocity.
! Angular frequency of the sound pressure and volume velocity.
c Velocity of sound inside the vocal-tract.
 Density of air inside the vocal-tract.
m m-th eigenvalue of the Webster's horn equation.
Fm m-th formant frequency.
a E ective lip radius.
!m Angular frequency of the mth formant for a tract with yielding walls.
!^m Angular frequency of the mth formant for a tract with rigid walls.
!0 Lowest angular resonance frequency when the tract is closed at both ends.
Lw Mass of the tract walls per unit length.
Cw Compliance of the tract walls per unit length.
xi
Rw Resistance of the tract walls per unit length.
a Ratio of wall resistance to mass.
b Squared angular frequency of the mechanical resonance.
c1 Correction for thermal conductivity and viscosity.
l Section length.
M Number of formants used as acoustic parameters.
N Number of articulatory components.
an n-th coecient of the Fourier cosine series expansion of the area function.
a Vector of Fourier cosine coecients.
Uay Matrix of cosines used to transform a into y.
Cy Covariance matrix of log-area vectors.
y Mean log-area vector.
S Diagonal matrix containing the eigenvalues of Cy in decreasing order.
U Unitary matrix whose columns contain the normalized eigenvectors of Cy .
Vector of principal component articulatory coecients.
 Mean of the Q vectors used to \ ll" the articulatory space.
U y Matrix used to transform into y.
Vector of independent component articulatory coecients.
T Matrix used to transform into .
f Vector formed by the rst three formant log-frequencies associated with a given
area function.
f Mean of the Q vectors used to \ ll" the acoustic space.
g Vector of independent component acoustic coecients.
Tfg Matrix used to transform f into g.
T g Matrix that approximates g as a linear transformation of .
G Matrix containing an ensemble of vectors g.
B Matrix containing an ensemble of vectors .
Q Number of vectors contained in the ensembles G and B.
Sh \Pseudo diagonal" matrix containing the eigenvalues of T gT g t.
Uhg Unitary matrix containing the normalized eigenvectors of T gT gt.
U Unitary matrix containing the normalized eigenvectors of T gtT g .
Vector of articulatory variables used in the principal component representation.
h Vector of acoustic variables used in the principal component representation.

xii
Ty Matrix used to transform y into .
0y Mean of the Q log-area vectors used to \ ll" the articulatory space.
Tfh Matrix used to transform f into h.
Rh Matrix of correlation coecients between h and .
Rf Matrix of correlation coecients between f and .
H Matrix containing an ensemble of Q acoustic vectors h.
Matrix containing an ensemble of Q articulatory vectors .
h Column vector containing the standard deviation of h.
 Column vector containing the standard deviation of .
p Number of frames contained in a given articulatory trajectory.
q Number of coecient vectors necessary to parametrize an articulatory
trajectory of p frames.
i Matrix containing a sequence of p articulatory vectors, starting at frame i.
C \Covariance" matrix of .
P0 Number of sequences of length p contained in the corpus.
 Diagonal matrix containing the eigenvalues of C .
V Matrix whose columns are the normalized eigenvectors of C .
i Principal component representation of .
V Matrix containing the rst q columns of V.
Yi Matrix containing a sequence of p log-area vectors, starting at frame i.
P Quadratic positional cost of a given articulatory vector .
Pa Quadratic positional cost of a given articulatory vector a.
Py Quadratic positional cost of a given log-area vector y.
H Weight matrix containing information about the morphology of the vocal-tract.
(Principal component representation case.)
Ha Weight matrix containing information about the morphology of the vocal-tract.
(Fourier representation case.)
Hy Weight matrix containing information about the morphology of the vocal-tract.
(Log-area representation case.)
C Covariance matrix of the articulatory vectors .
CP Average of the costs of all articulatory vectors in the corpus.
h Variation in the acoustic vector h.
 Variation in the articulatory vector .

xiii
a Jacobian matrix de ned by dh=d .
1 Vector containing the rst M components of .
2 Vector containing the last N M components of .
1 Jacobian matrix de ned by @ h=@ 1.
2 Jacobian matrix de ned by @ h=@ 2.
0 Articulatory vector corresponding to the neutral position of the vocal-tract.
 0 d =d 2 .
IN M Identity matrix of order N M .
p  0 t H .
 Matrix formed by the combination of a and p .
h Vector of variations containing containing acoustic and minimum e ort targets.

M
kk Norm function de ned by the largest component (absolute value) of a vector.
\Matrix of matrices" containing the locally linear relation between a sequence
of articulatory variations and their acoustic and cost counterparts.
G Sequence of articulatory variations  i vertically arranged as a column vector.

V
H Sequence of vectors hi, vertically arranged as a column vector.
Np  Nq matrix whose nonzero elements vij are the entries of V  .
Each row of V  is repeated N times.
X Columns of a variation  i rearranged in a column vector.
E


MV
Np  1 column vector containing the approximation error between
X and H.
The squared error E t H E .
H Weight matrix used in the computation of the squared error .
0 Neutral trajectory determined by the vocal-tract sustaining the neutral
position along the analyzed interval.

xiv
Contents
Abstract iii
Acknowledgements x
List of Symbols xii
1 Introduction 1
2 Area Function and Formant Frequencies 6
2.1 Corpus : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8
2.1.1 Sampling vocal-tract pro les : : : : : : : : : : : : : : : : : : : 8
2.1.2 From pro les to midsagittal distances : : : : : : : : : : : : : : 9
2.1.3 From midsagittal distances to area function : : : : : : : : : : 12
2.1.4 Log-area function : : : : : : : : : : : : : : : : : : : : : : : : : 12
2.2 Computation of the formant frequencies : : : : : : : : : : : : : : : : 15
2.2.1 Lossless tube : : : : : : : : : : : : : : : : : : : : : : : : : : : 15
2.2.2 Lossy tube : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 16
2.2.3 Numerical determination of formant frequencies : : : : : : : : 18
2.2.4 Comparison of lossless and lossy models : : : : : : : : : : : : 20
3 Parametric Models for the Area Function 24
3.1 Fourier Analysis : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 25
3.1.1 Truncation E ects : : : : : : : : : : : : : : : : : : : : : : : : 26
3.1.2 Formant frequencies as functions of Fourier coecients : : : : 27
3.2 Statistical Analysis : : : : : : : : : : : : : : : : : : : : : : : : : : : : 29
3.2.1 Principal Component Analysis : : : : : : : : : : : : : : : : : : 29
3.2.2 Independent component analysis : : : : : : : : : : : : : : : : 38
xv
3.2.3 Singular Value Decomposition : : : : : : : : : : : : : : : : : : 41
3.3 Temporal Analysis : : : : : : : : : : : : : : : : : : : : : : : : : : : : 46
4 The Inverse Problem 54
4.1 Isolated frames : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 57
4.1.1 Representing Morphological Constraints : : : : : : : : : : : : 57
4.1.2 Solving the inverse problem : : : : : : : : : : : : : : : : : : : 63
4.2 Trajectories : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 68
5 Results and Discussion 73
5.1 Isolated Frames : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 74
5.2 Trajectories : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 75
5.3 Quantitative Analysis : : : : : : : : : : : : : : : : : : : : : : : : : : : 83
6 Conclusion 86
A Numeric Information 88
B Results for Isolated Frames 92
Bibliography 126
List of Publications 134

xvi
List of Tables
2.1 Sentences contained in the corpus : : : : : : : : : : : : : : : : : : : : 8
5.1 Numerical Results: Areas : : : : : : : : : : : : : : : : : : : : : : : : : 84
5.2 Numerical Results: Length : : : : : : : : : : : : : : : : : : : : : : : : 84
5.3 Numerical Results: Formants : : : : : : : : : : : : : : : : : : : : : : : 85

xvii
List of Figures
2.1 Spectral properties of the speech signal determined by di erent properties
of the vocal apparatus. Only the vocal-tract shape has major in uence on
the formant frequencies. : : : : : : : : : : : : : : : : : : : : : : : : : 7

2.2 Left: midsagittal pro le tracing extracted from cineradiographic frame and
labial tracing extracted from video frame recorded synchronously. Center:
points sampled from the grid and from the labial region to represent the
pro le. Right: Pro le approximated by sampled points. : : : : : : : : : : 9

2.3 Procedure used to determine the midsagittal distance and the length of a
given section. The section length is determined by the distance between
the intersections of the bisection of the angle formed by the lines de ned
by the section walls with the grid lines that delimit the section. The
midsagittal distance is de ned as the distance between the intersections of
the line passing through the midpoint of the segment that determines the
section length, and orthogonal to it, with the section walls. Section walls
and grid lines are represented by thick and thin solid lines, respectively. In
the case shown in the left, the grid lines are part of a Cartesian region of
the grid, whereas in the case shown in the right, the grid lines belong to
the polar region of the grid. : : : : : : : : : : : : : : : : : : : : : : : : 10

2.4 Top: midsagittal distances of the pro le shown in Fig. 2.2 plotted as a
function of the distance from the glottis. Center: area function estimated
with the model. Bottom: Log-area function sampled at uniformly
spaced points along the vocal-tract. : : : : : : : : : : : : : : : : : : : : 11
xviii
2.5 Comparison between vocal-tract transfer function computed from the area
function estimated from midsagittal distances, and spectral envelope of
the speech recorded synchronously. For the example shown in the left, the
spectral envelope obtained from the pre-emphasized speech signal (gray
line) has its formants (peak frequencies) matching fairly well the formants
of the transfer function derived from the area function (black line). How-
ever, in the case shown in the right, due to the relatively low energy of
the speech signal, the colored noise generated by the experimental appa-
ratus produces a spurious peak around 1.3kHz, considerably a ecting the
estimated spectral envelope. : : : : : : : : : : : : : : : : : : : : : : : : 13
2.6 On each column the second panel from the top shows a comparison of
the transfer functions estimated from lossy (black solid line) and lossless
(dashed line) models of the vocal-tract area function. The gray line rep-
resents the speech power spectrum envelope. The speech signal, area
function, midsagittal distance and vocal-tract pro le are also given as ref-
erences. Note that there is little discrepancy between lossy and lossless
formant frequencies. : : : : : : : : : : : : : : : : : : : : : : : : : : : 22
2.7 Formant frequency variation due to yielding walls and radiation load. Each
chart shows a histogram of the relative deviation of formants derived from
a lossless vocal-tract model with respect to formants derived from a lossy
model: (Flossless Flossy )=Flossy . The ordinates show the percentage of
points in the corpus whose relative deviation falls within the the abscissa
interval under a given bin. : : : : : : : : : : : : : : : : : : : : : : : : : 23
3.1 Area function for the French /i/, and the same area approximated by
truncated Fourier cosine series expansions (thick lines), compared with
the original area (thin lines). : : : : : : : : : : : : : : : : : : : : : : : : 27
3.2 Formant frequencies as a function of the number of Fourier cosine coe-
cients used to represent the are function of the French /i/. (Obs. Note
that N includes the 0th order term.) : : : : : : : : : : : : : : : : : : : 28
3.3 First three formants as functions of the rst 6 Fourier cosine coecients.
In each graph, all other coecients are kept equal to zero. A histogram of
the coecients of the log-area functions of the corpus analyzed is plotted
above each chart. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 30
xix
3.4 First and second formant frequencies as functions of rst and third Fourier
cosine coecients a1 and a3; when all other coecients are equal to zero. 31
3.5 Eigenvalues of the log-area covariance matrix. : : : : : : : : : : : : : : 33
3.6 Eigenvectors corresponding to the rst 5 eigenvalues obtained from the
decomposition of the log-area covariance matrix. All eigenvectors are nor-
malized to have unit Euclidean norm. The rst K = 32 components
correspond to the log-area along the tract; and the last component corre-
sponds to the tract length. The corresponding eigenvalue square root is
given as a reference to the \importance" of each eigenvector. : : : : : : 35
3.7 Area function approximations by Fourier cosine series expansion (dashed
line), and by statistically optimum eigenvalue expansion (solid line). The
thick solid line shows the original area. Above: expansion with 3 compo-
nents. Below: expansion with 5 components. : : : : : : : : : : : : : : 36
3.8 Vocal-tract length, lip area, and alveopalatal area trajectories along the
sentence (in French): Ma chemise est roussie. The dashed lines show the
original measured trajectories, while the solid lines show the trajectories
parametrized by the model proposed here. For each case, the mean and the
standard deviation values of the relative di erence (in percentage) between
parametrized and original trajectories are also shown. : : : : : : : : : : : 37
3.9 (a) Parametric subspace determined by the rst two components of .
(b) Points corresponding to realistic area functions (articulatory space).
(c) The same points shown in a coordinate system with \less dependent"
components. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 40
3.10 (a) Normalized histograms of the rst 3 formant log-frequencies corre-
sponding to an articulatory space lled with approximately uniformly dis-
tributed points. (b) Histograms of the variables obtained after indepen-
dent component analysis (ICA) of the formant log-frequencies. (c) and
(d) Scatter plots of the rst 2 variables shown in (a) and (b), respectively. 42
3.11 Basis vectors (rows of Ty ) and mean vector (y ) used to represent log-
area vectors (y). Units: the rst K = 32 components are log-areas along
the tract expressed in log(cm2 ). The last component is the tract length
expressed in normalized units (1 unit = 0.53cm for the basis vectors and
1 unit = 200.53cm for the mean vector). : : : : : : : : : : : : : : : : 45
xx
3.12 Scatterings representing the joint distributions of the components of the
acoustic variable h and the components of the articulatory variable .
Note the high correlation between 1 and h1, and between 2 and h2. See
also the nonlinear relation between 1 and h3 . : : : : : : : : : : : : : : 47
3.13 First two acoustic components (h1; and h2) expressed as functions of the
rst two articulatory components ( 1 and 2 ), when all other components
( 3 ; 4 ; and 5 ) are equal to zero. Note that h1 is almost independent
of 2 , and that there are one-to-one relationships between h1 and 1 , and
between h2 and 2 . : : : : : : : : : : : : : : : : : : : : : : : : : : : : 48
3.14 Articulatory and acoustic component trajectories along the sentence (in
French): Ma chemise est roussie. Note the similarity between the rst
two articulatory trajectories and the rst two acoustic trajectories. (The
dashed lines in the acoustic trajectories indicate the intervals where the
formants cannot be reliably extracted from the speech signal due to very
narrow constrictions in the area function.) : : : : : : : : : : : : : : : : 49
3.15 Eigenvalues of the \covariance" matrix  of sequences of parametrized
log-area vectors, and corresponding rst four eigenvectors. : : : : : : : : 52
3.16 (a) Sequence of area functions, taken from the corpus, corresponding to
the diphthong /ui/, uttered in the (French) sentence \Luis pense a ca."
(b) Sequence of areas reconstructed from the parametric principal com-
ponent representation of the original areas shown in (a). (c) Sequence
of areas reconstructed from the parametric Fourier representation of the
original areas shown in (a). (d), (e) and (f) show formant frequency tra-
jectories corresponding to the sequences of areas shown in (a), (b) and
(c) respectively. The dashed lines shown in (e) and (f) are the original
formant trajectories shown in (d). For each pair of formant trajectories,
the maximum relative di erence (in percentage) is also shown. : : : : : : 53
4.1 Representation of the one-to-one relationship between the N M dimen-
sional subspaces that form the N dimensional articulatory space. Compare
with gures 4.2 and 4.3, where the level curves are N M = 1 dimen-
sional subspaces (contained in an N = 2 dimensional articulatory space)
which map onto an M = 1 dimensional acoustic space. : : : : : : : : : 55
xxi
4.2 Top left: the rst formant F1 as a function of the Fourier cosine coe-
cients a1 and a2; when all other coecients are equal to zero. Top right:
paraboloidal surface representing the cost function P a = at Ha a used to
quantify the vocal-tract e ort. Bottom: The solid thick lines show level
curves of the surface shown in the top left panel. (Compare with the gen-
eral case in Fig. 4.1.) The solid thin ellipses show level curves of the cost
function shown in the top right panel. The dashed circles represent the
particular case when P y is an unweighted squared Euclidean distance (i.e.
when Hy is an identity matrix.) : : : : : : : : : : : : : : : : : : : : : : 58
4.3 Top left: the rst acoustic component h1 as a function of the principal
component coecients 1 and 2 ; when all other coecients are equal
to zero. Top right: paraboloidal surface representing the cost function
P = tH used to quantify the vocal-tract e ort. Bottom: The solid
thick lines show level curves of the surface shown in the top left panel.
(Compare with the general case in Fig. 4.1.) The solid thin ellipses show
level curves of the cost function shown in the top right panel. The dashed
ellipses represent the particular case when P y is an unweighted squared
Euclidean distance (i.e. when Hy is an identity matrix.) : : : : : : : : : 59
5.1 Results obtained with the inversion technique for isolated frames. In each
column, the central panel shows original (thin line) and estimated area
(thick line). The estimated area is obtained from the formant frequencies
determined by the original area. Vocal-tract pro le, midsagittal distances,
transfer functions and speech signal are also shown for reference purposes.
From left to right the columns correspond to the neutral, and French /a/,
/i/, and /u/ vowels. : : : : : : : : : : : : : : : : : : : : : : : : : : : : 76
5.2 Problems with the inversion procedure. The columns show the following
cases. Left: French /a/ with excessively open lips. Center-left: French
/u/ with excessively large front cavity. Center-right: French /i/ with
excessively short length. Right: French /e/ with excessively closed lips
compensating underestimated length. As in Fig. 5.1, in the central panel
of each column, the thin line is the original area and the thick line is the
area estimated from the formant frequencies determined by the original
area. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 77
xxii
5.3 Top: Sequence of area functions, taken from the corpus, corresponding
to the diphthong /ui/, uttered in the French sentence \Luis pense a ca."
Bottom: Formant frequency trajectories corresponding to the sequences
of areas shown in the top panel. : : : : : : : : : : : : : : : : : : : : : 79

5.4 Top: Sequence of areas estimated from the formant trajectories shown
in the bottom panel of Fig. 5.3, under continuity and minimum e ort
constraints. Bottom: The solid lines show the formant frequency trajec-
tories corresponding to the sequences of areas shown in the top panel.
The dashed lines reproduce the original formant trajectories shown in the
bottom panel of Fig. 5.3. : : : : : : : : : : : : : : : : : : : : : : : : : 80

5.5 Top: Sequence of area functions estimated from the formant trajectories
shown in Fig. 5.3 under the following constraint: The areas are represented
by the rst six components of its Fourier cosine series expansion with the
even coecients set to zero. Bottom: The solid lines show the formant
frequency trajectories corresponding to the sequences of areas shown in
the top panel. The dashed lines reproduce the original formant trajectories
shown in the bottom panel of Fig. 5.3. : : : : : : : : : : : : : : : : : : 81

5.6 Top: Sequence of area functions estimated from the formant trajectories
shown in Fig. 5.3 under the following constraint: The areas are repre-
sented by the rst nine components of its Fourier cosine series expansion
determined under morphological and continuity constraints. Bottom: The
solid lines show the formant frequency trajectories corresponding to the
sequences of areas shown in the top panel. The dashed lines reproduce
the original formant trajectories shown in the bottom panel of Fig. 5.3. : 82
xxiii
5.7 (a) Scattering of the cross-sectional areas obtained from the parametric
principal component representation of the original areas plotted against
their original counterparts. The scattering of the formant frequencies de-
rived from the areas is shown in (e). (b) Cross-sectional areas estimated
from formant vectors in the case of isolated frames plotted against original
areas. The formant frequencies derived from the areas are plotted in (f).
(c) Cross-sectional areas estimated from formant vector trajectories plot-
ted against original areas. The formant frequencies derived from the areas
are plotted in (g). (d) Cross-sectional areas estimated from isolated frames
plotted against areas estimated from formant vector trajectories. The for-
mant frequencies derived from the areas are plotted in (h). The correlation
coecients are given in the top right corner. The 258 oral vowel frames
available in the corpus were used to generate the scatterings. : : : : : : 85

xxiv
Chapter 1
Introduction
\A voice cannot carry the tongue
and the lips that gave it wings.
Alone must it seek the ether."
Gibran Kahlil Gibran (1883{1931)
The Prophet

The speech production process is the result of the combination of articulatory


movements acting on and interacting with the air owing from the lungs to the mouth.
The part of the acoustic e ects of such actions and interactions that is radiated
(mainly) through the lips and nostrils constitutes the speech signal (Flanagan, 1972).
An ecient representation of speech is fundamental for obtaining good perfor-
mance in applications such as speech synthesis, coding, and recognition (Rabiner and
Schafer, 1978). While parametrization of the speech waveform itself is often sucient,
it does not explicitly give speech features like spectral envelope and pitch informa-
tion that are important, for example, in speech recognition systems (Rabiner and
Juang, 1993). Moreover, waveform parametrization allows only a limited degree of
compression of the speech signal (Jayant and Noll, 1984).
1
2 CHAPTER 1. INTRODUCTION

Higher levels of compression, as well as a clearer representation of meaningful


speech acoustic parameters, can be achieved by modeling the physical process that
generates the speech signal. Today, most of the methods following this line are based
on linear prediction theory (Markel and Gray, 1976). Such methods are very ecient
in parametrizing short intervals (frames) of speech. However, since speech acoustic
parameters can vary abruptly, smoothness constraints, in general, can not be imposed.
If such constraints were possible, they could be invoked to improve the accuracy of
parameter estimation (especially under adverse conditions), to attain even higher
compression levels, and to simplify models for speech synthesis.
Although, in general, smoothness constraints can not be imposed on acoustic pa-
rameters, they can be used successfully to characterize vocal-tract articulatory move-
ments (Perkell, 1969). Therefore, assuming that articulatory synthesis (synthesis of
speech based on vocal-tract con guration parameters) is a goal that can, in principle,
be achieved (Maeda, 1982; Sondhi and Schroeter, 1987; Scully, 1990; Lin, 1990), the
use of articulatory parameters can be very useful for an ecient representation of the
speech process (Flanagan, 1972).
If speech is to be represented by articulatory parameters, then, besides developing
methods to generate speech from such parameters (the direct problem), it is necessary
to be able to estimate the vocal-tract con guration from the speech signal (the inverse
problem). This includes determination of subglottal and glottal conditions (voice
source), vocal-tract shape and losses, and radiation load. This study focuses on the
estimation of the vocal-tract shape, which is the primary determinant of the formant
structure of the speech signal (Fant, 1980).
The extraction of geometrical characteristics of the vocal-tract from its acoustic
features has been discussed in several previous studies: Schroeder (1967) analyti-
cally described the relationship between the singularities (poles and zeros) of the
vocal-tract admittance measured at the lips and the vocal-tract cross-sectional log-
area function (represented by its Fourier cosine series expansion). The analysis was
performed for variations within the limit of applicability of rst order perturbation
theory. For larger variations, Mermelstein (1967) developed a numerical procedure
to estimate the area function (parametrized by the rst 6 coecients of its Fourier
cosine series expansion) from the admittance singularities. He showed that the for-
mant frequencies, which correspond to the admittance poles, are not sucient to
3

uniquely determine the log-area function. The remaining necessary information can
be obtained from the admittance zeros but, unfortunately, these cannot be estimated
from the speech signal. Schroeder (1967) developed then an experimental apparatus
to measure the vocal-tract admittance at the lips and, using a frequency domain ap-
proach, was able to determine good approximations for the area function. However,
the problem of estimating the area function from the speech signal still remained to
be solved.
With the advent of linear prediction theory applied to speech (Itakura and Saito,
1968; Atal and Hanauer, 1971), Wakita (1973, 1979) developed an inverse ltering
technique to estimate the vocal-tract area function from the speech waveform. How-
ever, that technique makes use of information about voice source, loss distribution,
tract length, and lip radiation that can not be assumed to be accurately known a
priori. In fact, Sondhi (1979) showed that the speech signal alone does not con-
tain enough information for a unique determination of the vocal-tract area function,
con rming the conclusions of Mermelstein (1967) and Schroeder (1967).
Thus, on the one hand, in order to achieve a practical system of articulatory speech
representation, it is necessary to obtain the vocal-tract shape from the speech signal.
On the other hand, the speech signal itself does not contain enough information to
uniquely determine such shape. Therefore, it is necessary to constrain the universe of
possible tract con gurations, so that the problem can be eciently solved. Since the
speech signal is assumed to be produced by a human vocal-tract, the human physi-
ology can be invoked as a natural possibility for a constraint formulation. In other
words, vocal-tract data obtained from acoustic (Sondhi and Resnick, 1983; Yehia et
al., 1995a, 1995b; Yehia and Itakura, 1995a), X-ray (Bothorel et al., 1986), magnetic
resonance imaging (MRI) (Baer et al., 1991; Tiede et al., 1996; Tiede and Yehia, 1996;
Yehia and Tiede, 1997), electromagnetic midsagittal articulometer (EMMA) (Perkell
et al., 1992; Yehia et al., 1996), or any other kind of tract measurement can be used
(directly or indirectly) as prior information for the vocal-tract shape estimation from
the speech signal.
Following this line, various frameworks have been formulated to combine the
acoustic information contained in the speech signal with the constraints determined
by the human physiology. A computer sorting technique followed by a ne optimiza-
tion procedure was used by Atal et al. (1978) and, in a more elaborate model, by
4 CHAPTER 1. INTRODUCTION

Schroeter and Sondhi (1991). Model matching techniques were used by Flanagan et
al. (1980) and by Shirai and Kobayashi (1986). Shirai (1993) also proposed a neural
network approach for the estimation of articulatory motion. Another connectionist
approach, making use of a control theory framework (whose principle was rst pro-
posed by Jordan, 1990) was presented by Bailly et al. (1991). As a nal example,
McGowan (1994) considered the use of genetic algorithms, obtaining interesting re-
sults for the dynamic case of the inverse problem. (A much more complete description
of the techniques developed to estimate vocal-tract shapes from the speech signal can
be found in Schroeter and Sondhi, 1994.)
The approaches cited above have two points in common: the rst one is that,
during the optimization procedure, acoustic and shape parameters are represented
in distinct spaces. This fact, besides resulting in a large number of optimization pa-
rameters, often leads to the problem of a cost function with local minima (Schroeter
and Sondhi, 1991). The second point is that an explicit articulatory model is al-
ways used. The problem here is that the design of articulatory models is normally
oriented toward the direct problem of speech production (Coker and Fujimura, 1966;
Mermelstein, 1973; Coker, 1976; Maeda, 1990). Although such models can be suc-
cessfully used in the inverse problem, problems of redundancy and ambiguity may
occur (Gupta and Schroeter, 1993).
Within the present study, these two problems are avoided. The rst, by mapping
acoustic and shape parameters into a common space. The second by using only the
statistical behavior of the vocal-tract, rather than an explicit articulatory model, to
formulate a cost function (Yehia and Itakura, 1996; Yehia, Takeda and Itakura, 1996).
In the case of the method proposed here, which can be included in the model
matching category, the following series of steps must be carried out:
 Acquisition of vocal-tract morphological, dynamic and acoustic information.
 Parametrization of the articulatory and acoustic spaces.
 Representation of the mapping from the articulatory space onto the acoustic
space.
 Formulation of a cost function which quanti es morphological and dynamic
constraints.
5

 Combination of acoustic, morphological and dynamic information to solve the


inverse problem.
These topics are addressed one by one in the following chapters. The main point
is the mapping from the articulatory onto the acoustic space. These spaces are appro-
priately represented such that the resulting mapping is simple enough to support a
one-to-one approximately linear relationship between the components of a subspace of
the articulatory space and the corresponding components of the acoustic space. This
fact is then exploited to nd a plausible solution for the restricted case of the inverse
problem considered here. The results obtained are then evaluated and interpreted.
Chapter 2
Area Function and
Formant Frequencies
\Everything ows."
Heraclitus (535{475 BC)
On Nature
The Area function and formant frequencies play an important role in the study
of speech production: they form a bridge between articulatory con gurations of the
vocal-tract and acoustic characteristics of speech. Formant frequencies are primarily
determined by the vocal-tract shape, with little in uence from other articulatory
factors (Flanagan, 1972, pp. 58{69). This is in contrast with other spectral properties
of the speech signal, over which factors other than the vocal-tract shape can also have
considerable in uence (see Fig. 2.1).
For the case of sound plane wave propagation, the cross-sectional area along the
vocal-tract (the area function) is the geometric property of the vocal-tract shape that
determines the formant frequencies. But the converse does not hold; that is, the
formant frequencies do not in turn uniquely determine the area function.
In this chapter, the relationship between area function and formant frequencies
will be analyzed. The objective is to determine the amount of information about the
area function contained in the formant frequencies, and to characterize the mapping
between the spaces formed by the area function and by the formant frequencies. These
pieces of information are important in obtaining a consistent base to solve the inverse
problem.
6
7

Formant Vocal-tract
Frequencies Shape

Formant Wall and viscous


Bandwidths losses
Speech
Signal Spectral Tilt Radiation Load

Harmonic Exciatation
Structure Pulse

Energy

Figure 2.1: Spectral properties of the speech signal determined by di erent properties
of the vocal apparatus. Only the vocal-tract shape has major in uence on the formant
frequencies.
8 CHAPTER 2. AREA FUNCTION AND FORMANT FREQUENCIES

2.1 Corpus
The corpus used in this study consists of cineradiographic data described in Both-
orel et al. (1986). (More details about the capabilities and limitations inherent to
X-ray measurements of the vocal-tract can be found in Fant (1970), and in Chiba and
Kajiyama (1941).) The procedure used to estimate the area functions from the cinera-
diography is described here. The analysis starts from digital tracings of midsagittal
pro les corresponding to ten French sentences (listed in Table 2.1) as spoken by a
female subject (PB), acquired at a rate of 50 frames per second. The corresponding
labiograms were acquired simultaneously. A sample is shown in Figure 2.2.
Table 2.1: Sentences contained in the corpus
Ma chemise est roussie.
Voila des bougies.
Donne un petit coup.
Une reponse ambigue.
Louis pense a ca.
Mets tes beaux habits.
Une p^ate a choux.
Pr^ete-lui seize ecus.
Chevalier du gue.
Il fume son tabac.

2.1.1 Sampling vocal-tract pro les


In order to convert the original pro les into area functions, a sequence of transfor-
mations is necessary. At rst, each midsagittal pro le is plotted on a semipolar grid,
using the hard palate as reference (Heinz and Stevens, 1964; Maeda, 1990). The grid
lines are spaced by 0.5cm in the linear regions, and by 11 degrees in the polar region.
Lip and larynx regions are speci ed in a special manner: The lips are modeled by
a uniform elliptical tube whose shape is determined from the labiogram, and whose
length (protrusion) is de ned as the distance from the upper incisors to the point of
minimum separation between upper and lower lips. The laryngeal cavity is modeled
as a trapezoidal tube (with circular cross-section) de ned by the two points where
the tract pro le intersects the sixth line of the semipolar grid, and by the lowest left
2.1. CORPUS 9

Digitized Profile Sampled Points Regenerated Profile


15 15 15
PB0146 PB0146 PB0146
25 20
30

10 10 10
15
cm

10
5 5 5
5

0
0 0 0
0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm

Figure 2.2: Left: midsagittal pro le tracing extracted from cineradiographic frame and
labial tracing extracted from video frame recorded synchronously. Center: points sam-
pled from the grid and from the labial region to represent the pro le. Right: Pro le
approximated by sampled points.

and right points of the tract pro le. This procedure, illustrated in the central panel
of Fig. 2.2, allows the representation of all pro les by the same number of points.
In the present case, there are 29 pairs of points, each pair containing a point on the
anterior wall and a point on the posterior wall of the tract. The approximation of the
pro le by the segments joining these points is shown in the right panel of Fig. 2.2.

2.1.2 From pro les to midsagittal distances


The next step is to represent the set of points plotted on each midsagittal pro le by
the corresponding midsagittal distances, plotted as a function of the position along
the vocal-tract. If the midsagittal distance is interpreted as the distance between the
points where an ideal longitudinal sound wavefront propagating in the tract \touches"
the anterior and posterior walls of the pro le, then an appropriate geometric procedure
to represent each section of the vocal-tract by a midsagittal distance and a section
length is as follows: each of the 28 sections de ned by the 29 pairs of points sampled
from a given pro le is seen as part of an in nite conical horn (or a cylindrical horn,
if the walls are parallel). The direction of propagation of the wavefront in this `horn'
is determined by the line that bisects the angle formed by the lines containing the
anterior and posterior wall pro les of the section (see Fig. 2.3). The intersection
10 CHAPTER 2. AREA FUNCTION AND FORMANT FREQUENCIES

points of this line with the grid lines that de ne the section determine a segment
whose length will be taken as the section length. Finally, the intersection points of
the line orthogonal to this segment passing through its midpoint with the anterior and
posterior wall pro les determine a segment whose length is taken as the midsagittal
distance of the section. The top panel of Figure 2.4 shows the midsagittal distance
plotted as a function of the distance from the glottis. The vocal-tract length is taken
as the sum of the section lengths of all sections.
This procedure follows the same physical principle adopted in Maeda (1972), but
with a di erent geometrical construction. Alternative procedures can be found, for
example, in Fant (1960), Beautemps et al. (1995), and Maeda(1990).

Grid Lines

Sagittal Distances

Section Lengths

Figure 2.3: Procedure used to determine the midsagittal distance and the length of a
given section. The section length is determined by the distance between the intersections
of the bisection of the angle formed by the lines de ned by the section walls with the grid
lines that delimit the section. The midsagittal distance is de ned as the distance between
the intersections of the line passing through the midpoint of the segment that determines
the section length, and orthogonal to it, with the section walls. Section walls and grid
lines are represented by thick and thin solid lines, respectively. In the case shown in the
left, the grid lines are part of a Cartesian region of the grid, whereas in the case shown in
the right, the grid lines belong to the polar region of the grid.
2.1. CORPUS 11

Midsagittal Distance along the Vocal−Tract


2
Midsag. Dist. (cm)

0
0 5 10 15
Cross−Sectional Area along the Vocal−Tract
4
Area (cm )
2

0
0 5 10 15
Log−Area Function along the Vocal−Tract
ln(Area) (ln(cm ))

2
2

−2
0 5 10 15
Distance from Glottis (cm)

Figure 2.4: Top: midsagittal distances of the pro le shown in Fig. 2.2 plotted as a function
of the distance from the glottis. Center: area function estimated with the model.
Bottom: Log-area function sampled at uniformly spaced points along the vocal-tract.
12 CHAPTER 2. AREA FUNCTION AND FORMANT FREQUENCIES

2.1.3 From midsagittal distances to area function


The midsagittal distances are now transformed into the corresponding cross-sectional
areas (Central panel of Fig. 2.4), using the \ model" originally proposed by
Heinz and Stevens (1964)
A(x) = (x)d(x) (x); (2:1)
where x is the distance from the glottis, A(x) and d(x) are respectively the cross-
sectional area and the midsagittal distance at x, and (x) and (x) are coecients
determined in an ad hoc manner. The values used here taken from Shinji Maeda
(1990).
The problem with such a transformation is that the two-dimensional information
contained in the midsagital pro le is not enough to obtain an accurate estimation of
the area function. Therefore, except for the lip region, where the labiogram provides
the necessary information, there may exist non-negligible discrepancies between the
real and the estimated area functions. Even when a more elaborate model, such as
those proposed by Perrier et al. (1992) and by Beautemps et al. (1995), is used, it
is impossible to eliminate the discrepancies. This is the main reason why, for a given
frame, the formants extracted from the speech signal do not match exactly those
numerically derived from the estimated area function (see Fig. 2.5). Other reasons
for this mismatch are errors in formant measurement from speech, and inaccuracies
in the physical model used to calculate the formants from the area function. In order
to avoid such discrepancies, the corpus of formant frequencies used in this and in
the next chapters consists of the formants numerically derived from the corpus of
estimated area functions (see Section 2.2.3), and not of those extracted from the
corresponding speech signal. By doing so, it is assured that the inaccuracies that will
appear in the results shown in the next chapter are inherent in the method proposed
to solve the inverse problem, and do not depend on the factors above. Admittedly, in
order to work with real speech, it is necessary to analyze such factors, but this task
will not be carried out here.

2.1.4 Log-area function


Instead of working directly with the cross-sectional area along the tract, each area is
transformed into a log-area vector, as shown in the bottom panel of Fig. 2.4. This
2.1. CORPUS 13

1 Speech Signal PB1549 1 Speech Signal PB1559


0 0
−1 −1
0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms)
Power Spectrum (dB) PB1549 Power Spectrum (dB) PB1559
40 40

20 20

0 0

0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz)
10 10
Area Function PB1549 Area Function PB1559
Area (cm2)

Area (cm2)
5 5

0 0
0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsagittal Distances PB1549 Midsagittal Distances PB1559
4 4

2 2

0 0
0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm)
15 15
Vocal−Tract Profile PB1549 Vocal−Tract Profile PB1559
25 20 25 20
30 30

10 10
15 15
cm

cm

10 10
5 5
5 5

0 0
0 0
0 5 10 15 0 5 10 15
cm cm

Figure 2.5: Comparison between vocal-tract transfer function computed from the area
function estimated from midsagittal distances, and spectral envelope of the speech
recorded synchronously. For the example shown in the left, the spectral envelope ob-
tained from the pre-emphasized speech signal (gray line) has its formants (peak frequen-
cies) matching fairly well the formants of the transfer function derived from the area
function (black line). However, in the case shown in the right, due to the relatively low
energy of the speech signal, the colored noise generated by the experimental apparatus
produces a spurious peak around 1.3kHz, considerably a ecting the estimated spectral
envelope.
14 CHAPTER 2. AREA FUNCTION AND FORMANT FREQUENCIES

transformation not only assures that the area function will always be positive, but
is also more meaningful from an acoustic point of view, since the area ratios, rather
than their values, determine the formant frequencies (see, for example, Mermelstein
(1967)). The transformation is simply the natural logarithm of the area, with areas
smaller than a given threshold  ( = 5mm2 in this study) being clipped to the
threshold value to avoid numerical problems with closures. Such clipping, however,
does not lead to signi cant inaccuracy from either an articulatory or acoustic point
of view. In the case of vowels, it was observed that the minimum area is not less than
25mm2, and that, even in the case of fricative sounds, the minimum area is not less
than 15mm2. Basically, areas are smaller than the threshold  = 5mm2 only in the
case of closures.
Due to the procedure used to plot the midsagittal pro les on the semipolar grid,
the 29 points shown in the top and central panels of Fig. 2.4 are not evenly spaced.
Even spacing allows a simpler representation of the cross-sectional area along the
tract since it can then be described by a vector of log-areas plus the vocal-tract
length. For this reason, using linear interpolation, the log-area along the vocal-tract
is resampled so that it can be represented by K = 32 uniform sections, as illustrated
by the stair-step graph shown in the bottom panel of Fig. 2.4.
Each log-area function present in the corpus, when approximated by a concatena-
tion of uniform tubes of equal length, can now be represented by a vector containing
the natural logarithm of the section areas and the tract length. In this study, the
following notation will be used
t
yi =
h i
y1i : : : yKi yK+1i ; i = 1; : : : ; P ; (2.2)
yki = ln Ai((k 21 ) LKi ); k = 1; : : : ; K ;
yK+1i = Li;
where Li is the tract length of frame i, expressed in units normalized so that the
variance of yK+1 is equal to the largest variance of the rst K components of y; Ai(x)
is the cross-section area of frame i at distance x from the glottis; K = 32 is the
number of uniform sections present in each area function; and P = 519 is the number
of vectors present in the corpus.
2.2. COMPUTATION OF THE FORMANT FREQUENCIES 15

2.2 Computation of the formant frequencies


In order to study the relationship between speech acoustic and vocal-tract geometric
(articulatory) parameters, the rst step is to understand the physical basis of the
process.

2.2.1 Lossless tube


A very simpli ed model of the vocal-tract is a rigid, lossless tube, whose cross-sectional
area varies along its length. If, moreover, sound is assumed to propagate in longitu-
dinal plane waves, the acoustic pressure inside the tract is governed by the Webster's
horn equation (Webster, 1919; Eisner, 1966)
d [A(x) dP ] + A(x)P = 0;  = !2 ; (2:3)
dx dx c2
with boundary conditions
dP = 0

dx



(2:4)
x=0
representing a closed glottis, and
P(L) = 0 (2:5)
representing open lips without radiation load. Here, x is the distance from the glottis
along the tract, P and ! are respectively the amplitude and the angular frequency of
the sound pressure (for a sinusoidal time dependence), A(x) is the cross-sectional area,
L is the tract length, and c is the velocity of sound in the air (inside the vocal-tract).
The formant frequencies are de ned as the resonant frequencies (or eigenfrequen-
cies) of the tract p
c m
Fm = 2  ; (2:6)
where m is the m-th eigenvalue of Eq. (2.3) under the boundary conditions de ned
by Eq. (2.4) and Eq. (2.5). Therefore, for a given vocal-tract length L and a given
set of boundary conditions, the formant frequencies are basically determined by the
cross-sectional area function A(x). In fact, rewriting Eq. (2.3) as
d2P + d [ln A(x)] dP + P = 0; (2:7)
dx2 dx dx
it is possible to see that the eigenvalues, and hence the formant frequencies, depend
on the logarithm of the area rather than on the area function itself (as already stated).
16 CHAPTER 2. AREA FUNCTION AND FORMANT FREQUENCIES

At this point, it is interesting to return to the works of Schroeder (1967) and


Mermelstein (1967) mentioned in the introduction, and explain them in terms of
the Webster's equation. Schroeder (1967) showed, within the limits of rst order
perturbation analysis, that the m-th formant frequency is directly related to, and
only to, the m-th odd coecient of the Fourier cosine series of the log-area function.
So, even if the whole (in nite) set of formant frequencies were known, only \half
of the information" (odd terms) necessary to determine the area function would be
available. The remaining information could be obtained from the complete set of
eigenvalues obtained under the boundary conditions
dP = 0 and dP = 0;


dx x=0

dx x=L



(2.8)

representing a closed glottis and closed lips, respectively. Under these conditions, to
rst order perturbation theory, the m-th eigenvalue is linearly related to, and only
to, the m-th even coecient of the Fourier cosine series of the log-area function.
Mermelstein (1967) then veri ed experimently that, even for larger variations, the
( nite) set composed of the rst 2M coecients of the Fourier cosine series expansion
of the log-area function has a one-to-one relationship with the set composed of the rst
M eigenvalues obtained under the closed glottis and open lips condition, together with
the the rst M eigenvalues obtained under the closed glottis and closed lips condition.
However, as already stated in the introduction, the latter set of eigenvalues, which
correspond to the admittance zeros at the lips, can not be obtained from the speech
signal.

2.2.2 Lossy tube


The Webster's equation describes the relationship between formant frequencies and
the log-area function for a highly simpli ed vocal-tract model. When a more realistic
model is considered, factors like nonplanar wave propagation, viscous and thermal
losses, glottal impedance, radiation load, and yielding walls must be taken into ac-
count. Nevertheless, it is interesting to note that, although all those factors do a ect
the spectrum of speech produced by the tract, most of them have almost no in uence
on the formant frequencies. Moreover, the factors that in uence the formant frequen-
cies can have their e ects approximately compensated for by simple transformations.
2.2. COMPUTATION OF THE FORMANT FREQUENCIES 17

Brie y examining these factors one by one, it is possible to say that: (i) considering
that the cross-sectional dimensions of the tract normally do not exceed 4 or 5cm, and
since c ' 350m/s inside the tract, there are no transversal resonance modes below
about 3.5 to 4kHz. Therefore, at least for the rst 3 formant frequencies (which are
normally below 3.5kHz), the plane wave propagation assumption is valid (Sondhi,
1974). (ii) Viscous and thermal losses do a ect the formant bandwidths, but have
little e ect on the formant frequencies (Flanagan, 1972, pp.58{61). (iii) The glottal
boundary condition has strong in uence on the spectral tilt and on the lower formant
bandwidths, but its e ect on the formant frequencies is of the order of 1% (for voiced
speech). Thus, if only the formant frequencies are to be considered, the closed glottis
is a good approximation for the glottal boundary condition, at least in the case of
vowels (Flanagan, 1972, pp.63{65). (iv) The approximate e ect of the radiation load
at the lips on the formant frequencies is to lower them by a factor of
s
3L ; a = A(L) ;
3L + 8a 
where `a' is the e ective lip radius and L is the tract-length (Flanagan, 1972, pp.61{
63). (v) Finally, the e ect of wall vibration on the formant frequencies is to increase
them. Such an e ect becomes weaker for higher frequencies, and can be approximately
expressed by
!2m ' (406)2 + !^ 2m; (2:9)
where !m and !^m are the angular frequencies of the m-th formant of a tract with
exible and rigid walls, respectively (Sondhi, 1974).1 More elaborate models do exist
(Maeda, 1982; Sondhi and Schroeter, 1987) as seen in the next section, but the main
point here is that formant frequencies and the log-area function are basically related
by Eq. (2.7).
Such a relationship (well analyzed in Fant, 1980), relatively independent of other
factors, justi es using the formant frequencies to parametrize the acoustics of the
speech signal, and the log-area function to represent the geometry of the vocal-tract.
The point here is that the formant frequencies can be determined from the log-area
function, even when a lossy model is considered. This is in contrast with other acoustic
1 In his work, Sondhi derived an equation in the same format of the Webster's equation, but taking
into account yielding walls, viscous and thermal losses. It was shown that the formant frequencies
of a lossy model and those of a lossless model are approximately related by Eq. (2.9).
18 CHAPTER 2. AREA FUNCTION AND FORMANT FREQUENCIES

parameters present in the speech signal, such as formant bandwidths and spectral
tilt, which result from the combination of the tract shape with other factors, such as
glottal excitation, yielding walls, and radiation load.2 However, as already stated, the
formant frequencies do not contain enough information to uniquely determine the area
function. In Chapter 3, information about the vocal-tract structure (morphology) will
be used to reduce the ambiguity that arises from the lack of information in the formant
frequencies about the area function.

2.2.3 Numerical determination of formant frequencies


Although existent for particular cases (Salmon, 1946; Eisner, 1966), there is no an-
alytical solution for the Webster's equation (Eq. 2.3) when the area A(x) along the
vocal-tract is an arbitrary function. For this reason numerical procedures are used
to determine the formant frequencies (eigenfrequencies) associated with a given area
function.3
The method adopted here approximates the vocal-tract by a concatenation of
uniform lossy tubes (Kelly and Lochbaum, 1962; Sondhi and Schroeter, 1987; Scroeter
and Sondhi, 1991). For the particular case of a uniform tube,4 the (one-dimensional)
sound wave equation can be solved analytically. From the solution, it is possible to
express, in the frequency domain, pressure and volume velocity at one end of the tube
as the product of a matrix by pressure and volume velocity at the other end. For the
K uniform sections that approximate the vocal-tract, this relation can be written as
2 3 2 32 3 2 3
Pk 1 A B Pk P
4 5= k k
4 54 = Kk k ;
5 4 5 k = 1; : : : ; K ; (2.10)
Uk 1 Ck Dk Uk Uk
where Pk , Uk , Pk 1 and Uk 1 are pressure and volume velocity at the section ends
closer to the lips and closer to the glottis respectively. Using the model for losses and
2 The use of additional acoustic information, such as formant bandwidths is, in principle, very
dicult to handle. It is so because, up to the author's knowledge, losses due to the vocal-tract and
to the glottal source can not be well separated. Moreover, the correct estimation of the bandwidths
is a task considerably more dicult to carry out than the estimation of formant frequencies (which,
sometimes, are also dicult to estimate, as in the case of the high-pitched voice of female and child
speakers).
3 Variational and perturbation methods were also tested (Yehia and Itakura, 1993b). It was found
that the range of applicability of perturbation analysis does not cover the entire articulatory vowel
space. Variational analysis has shown to be more robust, but at a computaional cost higher than
that of the numerical procedure adopted here.
4 In V
alimaki and Karjalainen (1994), the interesting alternative approach of conical sections is
analyzed.
2.2. COMPUTATION OF THE FORMANT FREQUENCIES 19

yielding walls described in Sondhi (1974) and in Sondhi and Schroeter (1987), the
entries of matrix Kk are given by
 l ; c sinh  l ;
! !

Ak = cosh Bk = (2.11)
c Ak c
A  l ;  l ;
! !

Ck = k sinh Dk = cosh
c c c
where
s

= + j! ; (2.12)
q
+ j!
 = ( + j!)( + j!); (2.13)
q
= j!c1; (2.14)
= j!!02
(j! + a)j! + b + ; (2.15)
!02 = c2 ; (2.16)
AL
k w
Rw
a =
Lw
; (2.17)
b =
1 ; (2.18)
L C
w w
where Lw , Cw and Rw are mass, compliance and resistance of the tract walls per unit
length; a is the ratio of wall resistance to mass; b is the squared angular frequency of
the mechanical resonance; c1 is the correction for thermal conductivity and viscosity;
!0 is the lowest angular resonance frequency of the tract when it is closed at both
ends; c is the sound velocity inside the tract;  is the air density; ! is the angular
frequency; Ak is the cross-sectional area of the section; and l = L=K is the section
length, equal to the vocal-tract length L divided by the number of sections K . The
numerical values used are
c = 3:5  104 (cm/s)
 = 1:14  10 3 (g/cm3)
a = 130 (rad/s)
b = (30)2 (rad/s)2
c1 = 4 (rad/s)
!02 = (406)2 (rad/s)2 .
20 CHAPTER 2. AREA FUNCTION AND FORMANT FREQUENCIES

Now, if sound pressure and volume velocity at the lips (PK and UK ) are known, it is
possible to determine pressure and volume velocity at the glottis (P0 and U0) as
2 3 2 3
P0 K P
= Kk K :
Y
4 5 4 5 (2.19)
U0 k=1 UK
The formant frequencies are obtained by nding the maxima of the vocal-tract transfer
function de ned by 20 log10(UK =U0). In order to compute this transfer function, the
sound pressure at the lips is expressed as
PK = Zr UK ; (2.20)
where Zr is the output impedance determined by the radiation load. The model used
to represent the load is an RL circuit in parallel (Flanagan, 1972, pp.36{38) with
Rr =
128 c ; (2.21)
(3)2 AKq
8 AK = c
and Lr =
3  c AK
: (2.22)
In the nal step, the volume velocity at the lips is arbitrarily set to UK = 1, and
Eq. 2.19 is used to obtain the volume velocity at the glottis U0. The transfer func-
tion is then given by 20 log10(U0). As already stated, the formant frequencies are
determined by the maxima of the transfer function, and can be found by numeric
search.

2.2.4 Comparison of lossless and lossy models


Before proceeding, it is interesting to compare the formant frequencies corresponding
to lossy and lossless area functions of same length and shape. The objective is to nd
to what degree losses in uence formant frequencies for case of the areas that compose
the cineradiographic corpus analyzed here. Figure 2.6 shows detailed information for
two frames extracted from the corpus. In the power spectrum panels, the transfer
functions obtained from lossless (dashed lines) and lossy models (solid black lines)
of the area function shown below are plotted together with the spectral envelope of
the speech frame shown above (gray line).5 The vocal-tract pro le and midsagittal
5 LPC analysis:
Frame length = 20ms; LPC order = 12; Hanning window; preemphasis coecient
 = r(1)=r(0), where r(0) is the energy and r(1) the rst correlation coecient of the signal (Markel
and Gray, 1976, p. 216).
2.2. COMPUTATION OF THE FORMANT FREQUENCIES 21

distance along the tract area are also plotted for reference purposes. As expected,
losses have stronger in uence on formant bandwidths than on formant frequencies.
Note also the imperfect matching between speech spectral envelope peaks and vocal-
tract transfer function peaks already discussed in Section 2.1.3. The small deviations
observed in the left column of Fig. 2.6 may be explained by inaccuracies in the physical
model adopted to describe yielding walls and radiation load. However, in the spectra
shown in the right column of Fig. 2.6, the larger deviation observed for the second
formant is more likely due to misestimation of the area function.
In order to get qualitative and quantitative information about the e ects of losses
over the whole corpus, Fig. 2.7 shows histograms of the relative deviation of the rst
four formant frequencies obtained using a lossless vocal-tract model with respect to
formants obtained using a lossy model
Flossless Flossy
Flossy :

The gure illustrates that the absence of losses has the e ect of increasing the fre-
quencies of the second, third and fourth formants. This is due to the nonexistence
of radiation load (see Section 2.2.2). The absence of yielding walls e ects has little
e ect on the second, third and fourth formants, but tends to substantially lower the
rst formant, which is also a ected by the lack of radiation load.

Comment
In this chapter it was seen how to estimate the area function given a vocal-tract
pro le; and how to compute formant frequencies given an area function. The problem
of estimating the area function given a set of formant frequencies is simpli ed if the
area function is expressed by appropriate parameters. This is the target of the next
chapter.
22 CHAPTER 2. AREA FUNCTION AND FORMANT FREQUENCIES

1 Speech Signal PB1518 1 Speech Signal PB1538


0 0
−1 −1
0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms)
Power Spectrum (dB) PB1518 Power Spectrum (dB) PB1538
40 40

20 20

0 0

0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz)
10 10
Area Function PB1518 Area Function PB1538
Area (cm2)

Area (cm2)
5 5

0 0
0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB1518 Midsagittal Distances PB1538


4 4

2 2

0 0
0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm)
15 15
Vocal−Tract Profile PB1518 Vocal−Tract Profile PB1538
25 20 25 20
30 30

10 10
15 15
cm

cm

10 10
5 5
5 5

0 0
0 0
0 5 10 15 0 5 10 15
cm cm

Figure 2.6: On each column the second panel from the top shows a comparison of the
transfer functions estimated from lossy (black solid line) and lossless (dashed line) models
of the vocal-tract area function. The gray line represents the speech power spectrum
envelope. The speech signal, area function, midsagittal distance and vocal-tract pro le
are also given as references. Note that there is little discrepancy between lossy and lossless
formant frequencies.
2.2. COMPUTATION OF THE FORMANT FREQUENCIES 23

Formant #1 Formant #2
25 25

20 20
Percentage (%)

Percentage (%)
15 15

10 10

5 5

0 0
−0.2 0 0.2 −0.2 0 0.2
Relative Deviation Relative Deviation
Formant #3 Formant #4
25 25

20 20
Percentage (%)

Percentage (%)

15 15

10 10

5 5

0 0
−0.2 0 0.2 −0.2 0 0.2
Relative Deviation Relative Deviation

Figure 2.7: Formant frequency variation due to yielding walls and radiation load. Each
chart shows a histogram of the relative deviation of formants derived from a lossless
vocal-tract model with respect to formants derived from a lossy model: (Flossless
Flossy)=Flossy . The ordinates show the percentage of points in the corpus whose relative
deviation falls within the the abscissa interval under a given bin.
Chapter 3
Parametric Models
for the Area Function
\You people speak in terms of circles
and ellipses and regular velocities|simple
movements that the mind can grasp|very
convenient|but suppose almighty God had
taken it into His head to make the stars move
like that... (He describes a irregular motion
with his nger through the air) ...then where
would you be?"
Bertold Brecht (1898{1956)
Galileo
The number of sections necessary to obtain a good approximation of the vocal-
tract log-area function by a concatenation of uniform tubes of equal length is con-
siderably larger than the dimension of the space composed by the log-area functions
that can be produced by the human vocal-tract. This space, from now on, will be
called the articulatory space. Two procedures are analyzed in this chapter to repre-
sent it by appropriate components. In the rst section, representation by a Fourier
cosine series is examined whereas a parametric statistical representation is seen in the
second section.
Another point that can be exploited in parametrizing the articulatory space is
the fact that the temporal behavior of the area function is subject to continuity
24
3.1. FOURIER ANALYSIS 25

constraints. In the last section of this chapter, time sequences of parametrized log-
area functions are represented by series expansions.

3.1 Fourier Analysis


The main reason to represent the log-area function by a Fourier cosine series (Davis,
1963, pp. 107{112) is the property pointed out by Mermelstein (1967) that the rst
M formant frequencies depend mainly on the rst 2M terms of the Fourier cosine
series expansion of the log-area function. Speci cally, the mth formant frequency
depends mainly on the (2m 1)th term. Also, except for some critical cases, it has
a one-to-one relationship with this term, when all other terms are kept constant.
In fact, except for the above mentioned critical cases, there is a one-to-one re-
lationship between the rst M formants and the rst M odd terms of the Fourier
cosine series expansion of the log-area function (Yehia and Itakura, 1993a; Yehia and
Itakura, 1996).
Due to the above reasons, whose importance will become clear along the text, the
rst N  2M log-area function Fourier cosine coecients1 will be adopted here to
parametrize the geometry of the vocal-tract. Mathematically it is de ned2 transfor-
mations here as s
N 1
ln A(x) ' 2 ( pa0 + an cos nx );
X

L 2 n=1 L (3.1)
s

an = L2 ln A(x)cos nx
L
Z

0 L dx; (3.2)
a0 = p1
L
Z

ln A(x)dx; (3.3)
L 0
or, in a discrete form,
s
N 1
ln Ak ' K ( + ancnk ); cnk = cos[n K (k 21 )]; k = 1; : : : ; K; (3.4)
2 pa0 X

2 n=1
s
K
an = K2 ln Ak cnk ; Ak = A[(k 12 ) KL ]; n = 1; : : : ; N 1; (3.5)
X

k=1
K
a0 = p1
X
ln A ; (3.6)
K k=1 k
1 Here, Fourier cosine coecient means coecient of the Fourier cosine series, and not cosine
coecient of the Fourier series.
This is a convenient de nition because of its properties of symmetry.
2
26 CHAPTER 3. PARAMETRIC MODELS FOR THE AREA FUNCTION

and, in a practical vector notation, area function and tract length can be put together
as (see Eq. 2.2),
y ' Uay a; (3.7)
a = Utay y; (3.8)
where
y = [ln A1; : : :; ln AK ; L]t; (3.9)
a = [a1; a3; : : : ; a2M 1; a0; a2; : : :; a2M ; : : :; aN 1; L]t; (3.10)
(The reason for this unusual component order will become clear in Section 4.1.2.)
2 3

6
c11 c31 : : : c2M 1 1 p12 c21 : : : c2M 1 : : : cN 1 1 0 7
s
6
6c12 c32 : : : c2M 1 2 p1 c22 : : : c2M 2 : : : cN 1 2 0 7
7
2 .. ... ... ...2 ... . . . ... ... ... :
6 7

Uay = 6
... ... (3.11)
7

K .
6 7
6 7
6 7
6
6
4
c1K c3K : : : c2M 1 K p1
2 c2K : : : c2M K : : : cN 1 K 0 7
7
5
0 0 ::: 0 0 0 ::: 0 ::: 0 1
In this discrete form, K is the number of uniform sections used to approximate the
area function. The area of each section is obtained by sampling the continuous area
function at the points xk = (k 12 ) KL ; where L is the tract length, and L=K is the
section length.

3.1.1 Truncation E ects


The e ects of taking a nite number of Fourier coecients are illustrated in gures 3.1
and 3.2 for the French vowel /i/. Figure 3.1 shows the area function represented
by a concatenation of K = 32 tubes and, following it, the results obtained by its
approximation by N = 9; 8; : : : ; 3 and 2 Fourier cosine coecients. It is possible
to see in Figure 3.2 that the m-th formant frequency (calculated with the model
described in Section 2.2.3) reaches a value close to its true value when the area
function is approximated by the rst N = 2m terms (including the 0-th order) of the
corresponding Fourier cosine series.
It is also possible to see that, for an approximation of N = 9 terms, the maximal
deviation observed in the gure for the rst M = 3 formants is less than 3%, and so,
3.1. FOURIER ANALYSIS 27

less than the JND (just-noticeable di erence) found by Flanagan (1955). Although
this limit is exceeded in some frames of the corpus, from now on, when using Fourier
representation, the area function will be approximated by the rst N = 9 terms of
its log-area Fourier cosine series expansion, plus the vocal-tract length.

20 20 20
Original /i/
Area (cm2)

Area (cm2)

Area (cm2)
N=9 N=8

10 10 10

0 0 0
0 10 20 0 10 20 0 10 20
position (cm) position (cm) position (cm)
20 20 20
Area (cm2)

Area (cm2)

Area (cm2)
N=7 N=6 N=5

10 10 10

0 0 0
0 10 20 0 10 20 0 10 20
position (cm) position (cm) position (cm)
20 20 20
Area (cm2)

Area (cm2)

Area (cm2)

N=4 N=3 N=2

10 10 10

0 0 0
0 10 20 0 10 20 0 10 20
position (cm) position (cm) position (cm)

Figure 3.1: Area function for the French /i/, and the same area approximated by trun-
cated Fourier cosine series expansions (thick lines), compared with the original area (thin
lines).

3.1.2 Formant frequencies as functions of Fourier coe-


cients
Formant frequencies can now be interpreted as functions of log-area Fourier coe-
cients. With the objective of getting a better comprehension of the behavior of these
28 CHAPTER 3. PARAMETRIC MODELS FOR THE AREA FUNCTION

Formant Frequencies versus Fourier Coefficients


5

4
Formant Frequencies (kHz)

0
1 2 3 4 5 6 7 8 9 Real Area
Number of Fourier Coefficients (N)

Figure 3.2: Formant frequencies as a function of the number of Fourier cosine coecients
used to represent the are function of the French /i/. (Obs. Note that N includes the
0th order term.)
3.2. STATISTICAL ANALYSIS 29

functions, Figure 3.3 shows how the rst three formant frequencies vary with respect
to each of the rst six Fourier cosine coecients (excluding the zero-th order), when
all other coecients are set to zero. A histogram of the Fourier cosine coecients
of the areas contained in the corpus is plotted above each graph. Note that the rst
three formant frequencies vary almost linearly with the rst three odd Fourier cosine
coecients.
The joint in uence of the rst (a1) and third (a3) Fourier cosine coecients on
the rst and second formant frequencies, when all other coecients are set to zero,
is shown in Figure 3.4. a1 and a3 have dominant in uence (only) on the rst and
second formants, respectively.

3.2 Statistical Analysis


The objective of this section is to nd representations for both the log-area (articu-
latory) space and the formant frequency (acoustic) space so that
 Each space be eciently represented by a small number of parameters.
 The components of each space be as independent as possible.
 The mapping between both spaces be as simple as possible.
These points will be analyzed one by one in the following sections.

3.2.1 Principal Component Analysis


Articulatory space
The relationship between Fourier cosine coecients and formant frequencies is indeed
interesting; however, for the case of the human vocal-tract, it cannot be said that the
parametrization by a truncated Fourier series is optimum. It is so because cosine
functions, which are the basis functions of a Fourier cosine series expansion, are
\general purpose functions" that, in principle, are not directly related to the vocal-
tract morphology.
In this section Principal Component Analysis (PCA) (Horn, 1985, pp. 411{455)
will be used to parametrize the articulatory space by an appropriate number of com-
ponents (Yehia, Takeda and Itakura, 1996). The procedure is as follows: given the
30 CHAPTER 3. PARAMETRIC MODELS FOR THE AREA FUNCTION

a1 a2
Freq. (kHz)

Freq. (kHz)
4 4

2 2

0 0
−5 0 5 −5 0 5
a3 a4
Freq. (kHz)

Freq. (kHz)

4 4

2 2

0 0
−5 0 5 −5 0 5
a5 a6
Freq. (kHz)

Freq. (kHz)

4 4

2 2

0 0
−5 0 5 −5 0 5

Figure 3.3: First three formants as functions of the rst 6 Fourier cosine coecients. In
each graph, all other coecients are kept equal to zero. A histogram of the coecients
of the log-area functions of the corpus analyzed is plotted above each chart.
3.2. STATISTICAL ANALYSIS 31

1 First Formant
F1 (kHz)

0
1 −1
0 0
−1 1
a1
a3
Second Formant

3
F2 (kHz)

−1 1
0 0
1 −1
a1
a3

Figure 3.4: First and second formant frequencies as functions of rst and third Fourier
cosine coecients a1 and a3; when all other coecients are equal to zero.
32 CHAPTER 3. PARAMETRIC MODELS FOR THE AREA FUNCTION

corpus of log-area vectors de ned in Eq. (2.2), the corresponding covariance matrix3
given by
P
Cy = P 1 1 [yi y ][yi y ]t; (3:12)
X

i=1
where y is the mean log-area vector; and can be expressed as
Cy = USUt; (3:13)
where S is a diagonal matrix containing the eigenvalues of Cy in decreasing order, and
U is a unitary matrix whose columns contain the corresponding normalized eigen-
vectors. The expansion above is a Takagi's factorization, which is a singular value
decomposition for the particular case of symmetric matrices (Horn, 1985, pp. 201{
218).
Using the same optimality principle of the Karhunen-Loeve transform (Jayant and
Noll, 1984, pp. 535{546), y can then be approximated by
y ' U y + y ; (3:14)
given by
= Ut y (y y ); (3:15)
where U y is the matrix containing the rst N columns of U, i.e. the normalized
eigenvectors corresponding to the N largest eigenvalues of Cy . The K + 1 = 33
eigenvalues are shown in Fig. 3.5. Note that only the rst N = 5 eigenvalues have
non-negligible values, and that they \explain" more than 92% of the variance of the
corpus of log-areas. The eigenvectors associated with the largest N = 5 eigenvalues
are shown in Fig. 3.6. They will be used in this paper to form a parametric model
for the vocal-tract log-area function. Since the components of this model cannot be
explicitly interpreted as articulators, it cannot be quali ed as an articulatory model
(Mermelstein, 1973; Coker, 1976; Maeda, 1990). In spite of that, it is possible to
observe in Fig. 3.6 that: the rst and most important eigenvector is associated with
the tongue region; the tract-length is the dominant component of the second and fth
eigenvectors; the lips determine the dominant component of the third eigenvector; and
3 All P vectors contained in the corpus were used to compute C . The possibility of not including
y
a few vectors in the estimation of Cy and use them only in the tests carried out in Chapter 5 was
considered. However, since by doing so the entries of U y varied less than 5%, we opted for using
the whole corpus to derive the model as well as to test it. Although this procedure is, admittedly,
not the most rigorous, it gives us an extensive base to analyze the model.
3.2. STATISTICAL ANALYSIS 33

Log−Area Eigenvalues
0.5
0.4 5% Threshold
Eigenvalue

0.3 1% Threshold
0.2
0.1
0
5 10 15 20 25 30
Log−Area Eigenvalue Sum
Eigenvalue Sum

1
0.8
0.6 90% Threshold
0.4 95% Threshold
0.2 99% Threshold
0
5 10 15 20 25 30
Eigenvalue Number

Figure 3.5: Eigenvalues of the log-area covariance matrix.


34 CHAPTER 3. PARAMETRIC MODELS FOR THE AREA FUNCTION

the tongue apex is the dominant region of the fourth eigenvector. Also, note that there
is almost no in uence of the glottal region on the rst three eigenvectors. In order to
illustrate the performance of this representation, Fig. 3.7 shows an area function taken
from the corpus (thick line), and its approximations by a truncated Fourier cosine
series (dashed line) and by the parametric model described here (thin solid line). Note
that, in contrast with the Fourier series representation, the parametric model is able
to \capture" the vocal-tract structure. In Fig. 3.8, original trajectories followed by
the tract length, by the area at the lips, and by the area of a section in the alveopalatal
region are shown by the dashed lines. The corresponding trajectories obtained with
the parametric model proposed here are shown by the solid lines. Since the parametric
model is derived from the log-area function, the approximation is particularly good
for small areas, which are critical from the acoustic point of view.

Dimensionality and degrees of freedom


Summarizing, it was shown that vocal-tract log-area vectors can be eciently repre-
sented in an N = 5 dimensional articulatory space. Here, it is interesting to note that
most articulatory models are expressed by seven to nine components. This happens
because their formulation is oriented to the speech production direct problem. In
that case, it is important to consider the number of degrees of freedom of the vocal
apparatus, which is usually larger than the dimension of the articulatory space.

Acoustic Space
To each log-area vector there exist one, and only one, set of formant frequencies
associated with it. Here, the set composed by the rst three formant log-frequencies4
will be called a formant vector, and the space formed by all formant vectors that can
be generated by the vocal-tract will be called the acoustic space.
By performing an eigenvalue decomposition on the covariance matrix of the for-
mant vectors (in log-scale) derived from the P = 519 log-area vectors of the corpus,
the following eigenvalues were found
[159 94 9]  10 4 ;
4 It
was veri ed a higher degree of linearity in the mapping between articulatory and acoustic
spaces when formant frequencies were represented in log-scale.
3.2. STATISTICAL ANALYSIS 35

Eigenvector #1 Eigenvector #4
0.5 0.5
0 0
−0.5 Sqrt(Eigenvalue) = 2.66 −0.5 Sqrt(Eigenvalue) = 0.89

Eigenvector #2 Eigenvector #5
0.5 0.5
0 0
−0.5 Sqrt(Eigenvalue) = 1.94 −0.5 Sqrt(Eigenvalue) = 0.84

Eigenvector #3 Glottis−−−−−>Lips−>Length

0.5
0
−0.5 Sqrt(Eigenvalue) = 1.48
Glottis−−−−−>Lips−>Length

Figure 3.6: Eigenvectors corresponding to the rst 5 eigenvalues obtained from the de-
composition of the log-area covariance matrix. All eigenvectors are normalized to have
unit Euclidean norm. The rst K = 32 components correspond to the log-area along
the tract; and the last component corresponds to the tract length. The corresponding
eigenvalue square root is given as a reference to the \importance" of each eigenvector.
36 CHAPTER 3. PARAMETRIC MODELS FOR THE AREA FUNCTION

Area Function: Vowel /i/

3 terms
Area (cm2)

0
0 5 10 15

5 terms
Area (cm2)

0
0 5 10 15
Distance from Glottis (cm)

Figure 3.7: Area function approximations by Fourier cosine series expansion (dashed
line), and by statistically optimum eigenvalue expansion (solid line). The thick solid line
shows the original area. Above: expansion with 3 components. Below: expansion with 5
components.
3.2. STATISTICAL ANALYSIS 37

Vocal−Tract Length Trajectory


Length (cm)

18
Mean Difference: 0.07%
16 Standard Deviation: 0.23%

14
0 0.2 0.4 0.6 0.8 1
Lip Area Trajectory
Area (cm2)

Mean Difference: 3%
5 Standard Deviation: 14%

0
0 0.2 0.4 0.6 0.8 1
Alveopalatal Area Trajectory
Area (cm2)

5 Mean Difference: −7%


Standard Deviation: 7%

0
0 0.2 0.4 0.6 0.8 1
Time (s)

Figure 3.8: Vocal-tract length, lip area, and alveopalatal area trajectories along the
sentence (in French): Ma chemise est roussie. The dashed lines show the original
measured trajectories, while the solid lines show the trajectories parametrized by the
model proposed here. For each case, the mean and the standard deviation values of the
relative di erence (in percentage) between parametrized and original trajectories are also
shown.
38 CHAPTER 3. PARAMETRIC MODELS FOR THE AREA FUNCTION

the normalized eigenvectors being given by the columns of the matrix


2 3

6
0:933 0:360 0:006 7
6
6
4
0:333 0:870 0:364 :
7
7
5
(3.16)
0:137 0:338 0:931
It means that more than 96% of the total variance can be explained by the rst two
eigenvalues. For this reason, the possibility of representing the acoustic space in two
dimensions was considered. However, since the acoustic information associated with
the third eigenvalue can be important for the inverse problem, it was decided to use
the rst three formant log-frequencies to parametrize a three-dimensional acoustic
space.

Principal components and formant frequencies


Unlike Fourier cosine coecients, there is not a one-to-one mapping between a sub-
space determined by a subset of the principal components and the space determined
by the formant log-frequencies. Such a mapping is a necessary condition for the so-
lution of the inverse problem described in the next chapter. This obstacle can be
overcome by performing appropriate transformations on both principal components
of the log-area function and formant log-frequencies.

3.2.2 Independent component analysis


The objective of this section is to perform linear transformations on the coordinate
systems of both articulatory and acoustic spaces, so that the components of each
space become as independent as possible. The nal objective is to nd a mapping of
the articulatory space onto the acoustic space, where each component of the acoustic
space is mainly determined by one, and only one, component of the articulatory space.
Also, each component of the articulatory space must have major in uence on at most
one component of the acoustic space. In order to attain this objective, a necessary
condition is that the components of each space be as independent as possible.

Articulatory Space
The rst step is to nd how the articulatory space, de ned in the last section, maps
onto the acoustic space. To reach this target, rstly, the hyperrectangle de ned by the
3.2. STATISTICAL ANALYSIS 39

maximum and minimum values of each of the N = 5 components of the parametrized


corpus is \ lled" with Q0 = 30; 000 uniformly distributed points.5 Fig. 3.9a illustrates
this operation by showing the projection on the subspace de ned by 1 and 2.
However, not all the points in the hyperrectangle correspond to realistic vocal-tract
areas. For this reason, all points in the hyperrectangle that correspond to areas out of
the limits de ned by the P = 519 areas present in the corpus described in Section 2.1
are discarded.6 The remaining Q = 7; 285 points are shown in Fig. 3.9b. After that,
the independent component analysis method proposed by Bell and Seijnowski (1995)
is applied to these points to nd a linear transformation (T : R 5 ! R 5) that changes
the coordinate system of the articulatory space into a system with statistically \less
dependent" components. (The term \less dependent" is used because, in the present
case, a simple linear transformation is not enough to obtain a complete decomposition
into independent components.) Mathematically, this transformation is written as
= T (  ); (3:17)
where  is the mean of the Q = 7; 285 vectors generated to \ ll" the articulatory
space. Fig. 3.9c shows the same points shown in Fig. 3.9b, now plotted in the new
coordinate system.

Acoustic Space
For a given point in the articulatory space, it is possible to nd the corresponding
log-area vector y using the following inverse transformation
y = U y (T 1 +  ) + y : (3:18)
Then, using the wave propagation model described in Section 2.2.3, it is possible
to calculate the formant vector f formed by the rst three formant log-frequencies
associated with y and, consequently, with
f = f( ) : (3:19)
5 30; 000 was chosen arbitrarily as a number suciently large to characterize a uniform distribution
in a ve-dimensional space.
6 All points associated with log-area vectors containing values outside the limits de ned by the
maximal and minimal values of each component of the corpus are discarded. (More details are given
in Appendix A.)
40 CHAPTER 3. PARAMETRIC MODELS FOR THE AREA FUNCTION

(a) Parametric Space


4

Alpha 2
−4

−8
−8 −4
0 4 8
Alpha 1
(b) Log−Area Projected Space
4

0
Alpha 2

−4

−8
−8 0−4 4 8
Alpha 1
(c) "Less Dependent" Components
6

3
Beta 2

−3

−6
−6 −3 0 3 6
Beta 1

Figure 3.9: (a) Parametric subspace determined by the rst two components of . (b)
Points corresponding to realistic area functions (articulatory space). (c) The same points
shown in a coordinate system with \less dependent" components.
3.2. STATISTICAL ANALYSIS 41

This procedure was carried out for all Q = 7; 285 points shown in Fig 3.9c. The cor-
responding formant log-frequency normalized histograms, which are approximations
for the probability density functions, are shown in Fig. 3.10a; while the scattering on
the plane de ned by f1 and f2 is shown in Fig. 3.10c. After that, the independent
component analysis (ICA) method described in Bell and Seijnowski (1995) was used
to nd a linear transformation (Tfg : R 3 ! R 3) that changes the coordinate system
de ned by the formant log-frequencies into a system with \less dependent" variables.
This transformation can be written as

g = Tfg (f f ); (3:20)

where f is the mean of the logarithm of the Q = 7; 285 formant vectors available.
The normalized histograms obtained for the components of g are shown in Fig. 3.10b,
and the scattering of the rst two components of g is shown in Fig. 3.10d.
At this point, g and de ne respectively acoustic and articulatory vector variables
whose components are more independent than the components of f and . The next
step is to model the relationship between acoustic and articulatory spaces.
Before continuing, it is worthwhile to write some lines about the independent com-
ponent analysis (ICA) technique used here. The ICA problem consists of nding a
linear transformation which, when applied to a given ensemble of random vectors,
transforms it into an ensemble of vectors whose components are statistically inde-
pendent, in an ideal case; or as independent as possible, in practical cases. The
theoretical background of the problem is very well described in Comon (1994). The
approach described in Bell and Seijnowski (1995) (and used in this paper) is based
on entropy maximization which, under appropriate conditions, implies mutual infor-
mation minimization, and consequent independence maximization. The method was
originally used to solve the problem of blind separation of mixed sound sources, but
has a potentially larger range of applications.

3.2.3 Singular Value Decomposition


In this section, the mapping from onto g is approximated by a linear transformation
(T g : R 5 ! R 3) as follows
g ' T g : (3:21)
42 CHAPTER 3. PARAMETRIC MODELS FOR THE AREA FUNCTION

(a) Formant Dist. (b) ICA


Number of Samples / All Samples

Number of Samples / All Samples


0.2 0.2
log(f1) g1
0.1 0.1
0 0
2.4 2.6 2.8 3 −4 0 4
0.2 0.3
log(f2) 0.2 g2
0.1
0.1
0 0
2.8 3 3.2 3.4 −8 −4 0 4
0.2 0.2
log(f3) g3
0.1 0.1
0 0
3.4 3.5 3.6 −4 0 4 8
log[ Frequency (Hz) ] Norm. Frequency Units
(c) log(f1)−log(f2) Plane (d) g1−g2 Plane
4
3.4
2
log[ f2 (Hz) ]

0
3.2
g2

−2

3 −4
−6
2.8 −8
2.4 2.6 2.8 3 −6 −4 −2 0 2 4 6
log[ f1 (Hz) ] g1

Figure 3.10: (a) Normalized histograms of the rst 3 formant log-frequencies correspond-
ing to an articulatory space lled with approximately uniformly distributed points. (b)
Histograms of the variables obtained after independent component analysis (ICA) of the
formant log-frequencies. (c) and (d) Scatter plots of the rst 2 variables shown in (a)
and (b), respectively.
3.2. STATISTICAL ANALYSIS 43

In such a case, once there is an ensemble of vectors g and available, a minimum


mean square error (MMSE) procedure can be used to estimate T g , yielding
T g = GBt(BBt) 1 ; (3:22)
with
G = [g1 : : : gQ]; (3:23)
and
B = [ 1 : : : Q ]: (3:24)
In the above equations, Q = 7; 285 is the number of points present in the ensembles.
Once T g is determined, a singular value decomposition procedure (Horn, 1985,
pp. 411{455) can be used to nd rotations of the acoustic (g) and articulatory ( )
coordinate systems, so that each of the rst three components of the articulatory
space has major in uence on one, and only one, component of the acoustic space.
The singular value decomposition of T g yields
T g = Uhg Sh U t ; (3:25)
where Uhg is a unitary matrix containing the normalized eigenvectors of T g T gt, U
is a unitary matrix containing the normalized eigenvectors of T g tT g, and Sh is a
3  5 matrix whose rst 3 columns de ne a diagonal matrix containing the square
roots of the eigenvalues of T g T g t, and the elements of the last two columns are all
equal to zero. Now, since the multiplication of a unitary matrix by a vector represents
a rotation of this vector,
= U t (3:26)
and
h = Uhgt g (3:27)
de ne, respectively, \rotated" articulatory and acoustic variables. Vectors and h
de ne parametric representations for log-area vectors y and formant log-frequency
vectors f. The relation between y and is obtained straightforwardly from equations
(3.14), (3.15), (3.17), and (3.26); while the relation between f and h is obtained from
equations (3.20) and (3.27) yielding
= Ty (y 0y ); (3.28)
44 CHAPTER 3. PARAMETRIC MODELS FOR THE AREA FUNCTION

y ' T y + 0y ; (3.29)


where Ty = Ut T Ut y ; (3.30)
T y = U y T 1U ; (3.31)
0y = y + U y  ; (3.32)
and h = Tfh(f f ); (3.33)
f = Tfh1h + f ; (3.34)
where Tfh = Uthg Tfg : (3.35)
The basis vectors contained in the rows of Ty as well as 0y are plotted in Fig. 3.11.
The numerical values are given in Appendix A.
The matrix of correlation coecients (Papoulis, 1991, p. 152) between and hcan
be estimated by
Rh = (Q H1)  t ;
t
(3:36)
h
where
H = [h1 : : : hQ]; (3.37)
= [ 1 : : : Q ]; (3.38)
h and  are the column vectors containing respectively the standard deviations of h
and , Q = 7; 285 is the number of points present in the ensembles, and the division
of H t by h t is performed element-wise. The numerical result obtained is shown
below
2 3
0:939 0:003 0:005
6
0:004 0:0047
Rh = 0:003 0:953 0:003
6
6
4
0:004 0:002 :
7
7
5
0:002 0:001 0:461 0:001 0:002
This matrix shows that there exists a high degree of correlation between the rst
two acoustic components and the rst two articulatory components. There is also
a not negligible degree of correlation between the third acoustic and articulatory
components. All other correlation coecients are very small.
At this point, in order to see the importance of the independent component anal-
ysis described in Section 3.2, it is interesting to compare Rh with the matrix of
correlation coecients obtained when f and are used in place of g and to obtain
3.2. STATISTICAL ANALYSIS 45

Basis Vector #1 Basis Vector #4


0.4
0.6
0 0

−0.4 −0.6
0.4 Basis Vector #2 0.4 Basis Vector #5

0 0

−0.4
−0.4
0.3 Basis Vector #3 Mean Vector
0 1
−0.3
−0.6 0
Glottis−−−−−−−−−−−−>Lips−>Length Glottis−−−−−−−−−−−−>Lips−>Length

Figure 3.11: Basis vectors (rows of Ty ) and mean vector (y ) used to represent log-area
vectors (y). Units: the rst K = 32 components are log-areas along the tract expressed
in log(cm2 ). The last component is the tract length expressed in normalized units (1 unit
= 0.53cm for the basis vectors and 1 unit = 200.53cm for the mean vector).
46 CHAPTER 3. PARAMETRIC MODELS FOR THE AREA FUNCTION

h and , as done in Yehia and Itakura (1995b) and in Yehia et al. (1995). The result
is shown below
2 3

6
0:944 0:270 0:206 0:308 0:035 7
Rf = 6
6
4
0:270 0:944 0:266 0:258 0:183 :7
7
5
0:112 0:142 0:511 0:069 0:184
Note that, although the correlation between the acoustic components and the corre-
sponding rst three articulatory components continues to exist, the other correlation
coecients are not negligible any more.
It should be pointed out, however, that uncorrelation does not imply indepen-
dence. This fact is illustrated in Fig. 3.12, where scatterings representing the joint
cross-distributions of the components of h and of are plotted. There exists, for
example, an apparent nonlinear relation between h3 and 1. This kind of dependence
cannot be well approximated by the linear transformation used in this work to model
the mapping from the articulatory space onto the acoustic space. In spite of these
limitations, the model successfully extracted two acoustic variables, namely h1 and
h2, which depend approximately linearly on two, and only two, articulatory variables,
namely 1 and 2. The remaining articulatory components, 3; 4; and 5, have little
in uence on h1 and h2. Moreover, 2 has little e ect on h1, and the in uence of 1
on h2 does not a ect the one-to-one relationship between 2 and h2. These facts are
illustrated in Fig. 3.13.
Once the parametric model is derived, and its basic characteristics are analyzed, it
is interesting to compare articulatory and acoustic component trajectories for a given
sequence of vocal-tract shapes. The trajectories associated with the French sentence
\Ma chemise est roussie" are shown in Fig. 3.14. It is possible to observe that the
rst two articulatory components are indeed closely related to the rst two acoustic
components. It is also possible to see that there exist some similarities between h3
and h1, indicating that they are not independent.

3.3 Temporal Analysis


Up to this point, each of the P = 519 log-area vectors present in the corpus was
parametrized by N (N = 5 in the statistical analysis, and N = 9 in the Fourier anal-
ysis) coecients, which contain also vocal-tract length information. In this section,
3.3. TEMPORAL ANALYSIS 47

Articulatory−Acoustic Scatterings
4
h1

0
−4
4
h2

0
−4
4
h3

0
−4
−4 0 4 −4 0 4 −4 0 4 −4 0 4 −4 0 4
Gamma1 Gamma2 Gamma3 Gamma4 Gamma5

Figure 3.12: Scatterings representing the joint distributions of the components of the
acoustic variable h and the components of the articulatory variable . Note the high
correlation between 1 and h1 , and between 2 and h2. See also the nonlinear relation
between 1 and h3.
48 CHAPTER 3. PARAMETRIC MODELS FOR THE AREA FUNCTION

First Acoustic Component

4
h1

−4
4
0 4
0
−4 −4
Gamma1 Gamma2

Second Acoustic Component

4
h2

−4
4
0 4
0
−4 −4
Gamma1 Gamma2

Figure 3.13: First two acoustic components (h1; and h2) expressed as functions of the
rst two articulatory components ( 1 and 2 ), when all other components ( 3 ; 4; and 5 )
are equal to zero. Note that h1 is almost independent of 2 , and that there are one-to-one
relationships between h1 and 1 , and between h2 and 2 .
3.3. TEMPORAL ANALYSIS 49

Articulatory Trajectories Acoustic Trajectories


5 5
Gamma1

h1
0 0

−5 −5
5 5
Gamma2

h2
0 0

−5 −5
5 5
Gamma3

h3

0 0

−5 −5
5 0 0.5 1
Gamma4

Time (s)
0

−5
5
Gamma5

−5
0 0.5 1
Time (s)

Figure 3.14: Articulatory and acoustic component trajectories along the sentence (in
French): Ma chemise est roussie. Note the similarity between the rst two articulatory
trajectories and the rst two acoustic trajectories. (The dashed lines in the acoustic
trajectories indicate the intervals where the formants cannot be reliably extracted from
the speech signal due to very narrow constrictions in the area function.)
50 CHAPTER 3. PARAMETRIC MODELS FOR THE AREA FUNCTION

the objective is to take sequences of p (e.g. p = 10) frames contained in a sentence,


and represent them with less than pN parameters. The procedure below is carried
out for sequences of parameters, but the same method can be applied to a (Fourier)
parameters as well. The parametrization is as follows. Representing sequences of p
frames by
i = [ i ; i+1 ; : : :; i+p 1 ] ;
t (3:39)
it is possible to compute the \covariance" 7
P
C = P 0 1 1 i ti;
0
X
(3:40)
i=1
where P 0 is the number of sequences of length p contained in the corpus. Then,
following the same method used for log-area statistical parametrization, it is possible
to express C as (Takagi's factorization)
C = VVt; (3:41)
where  is a diagonal matrix containing the eigenvalues of C in decreasing order,
and the columns of V are the corresponding normalized eigenvectors. i can then be
approximated by

i ' iV ;
t (3:42)
 i given by
 i = iV ; (3:43)
where V  is the matrix containing the rst q columns of V (i.e. the normalized
eigenvectors corresponding to the q largest eigenvalues of C .) The components of
 i are orthogonal in the sense that

E [ ti i] = E [Vt  ti iV ] = Vt C V  = q ;
E [] denoting expected value, and q being the diagonal matrix containing the q
largest eigenvalues of C in decreasing order.
Thus, a sequence of p log-area vectors
2 3
y1i : : : y1;i+p 1
Yi = ... ...
6 7
6
6
4
7
7
5
(3.44)
yK+1;i : : : yK+1;i+p 1
7 \Covariance"is quoted because f i ; i = 1; : : :; P g is an ensemble of matrices, and not of
0

vectors as it should be to de ne a classic covariance matrix. Nevertheless, C contains information


about the covariance between components that parametrize the log-area function at di erent times.
3.3. TEMPORAL ANALYSIS 51

containing p(K +1) elements (e.g. p(K +1) = 1033 = 330) can be approximately
represented by a matrix
2 3
11i : : : 1qi
i = ...
6
6
6
4
... 7
7
7
5
(3.45)
N 1i : : : Nqi
containing qN elements (e.g. qN = 4  5 = 20), with
i = Ty (Yi 0y )V ; (3.46)

Yi ' T y iVt  + 0y : (3.47)
The eigenvalues and rst four normalized eigenvectors of the \covariance" matrix
C are shown in Fig 3.15. It can be seen that the eigenvectors have approximately
the shape of cosine functions. This indicates that a Fourier expansion series is also
appropriate to represent the temporal behavior of parametrized log-area sequences
(Yehia and Itakura, 1993a, 1994).
The representation of an area function sequence by the method described here is
illustrated in Fig.3.16. Panel (a) shows the original sequence taken from the corpus
while panel (d) shows the corresponding rst three formant trajectories. Panel (b)
shows the sequence recovered from a parametrization by N q = 54 = 20 coecients
obtained with the principal component analysis (PCA) described in this and in the
previous section. Finally, panel (c) shows the sequence seen in panel (a) recovered
from a two-dimensional Fourier cosine series approximation by N  q = 9  4 = 36
coecients. Note the good agreement between the formant trajectories associated
to the recovered area function sequences (panels (e) and (f)). Not surprisingly, also
note that, even using considerably less parameters, PCA analysis preserves better the
morphological characteristics of the vocal-tract.

Comment
Now, the vocal-tract is represented by appropriate parameters. The next task is to
estimate these parameters from formant log-frequencies. A procedure to accomplish
this task is the topic of the next chapter.
52 CHAPTER 3. PARAMETRIC MODELS FOR THE AREA FUNCTION

Eigenvalues of the "Covariance" Matrix (Lambda)


100

50

0
1 2 3 4 5 6 7 8 9 10
Normalized Eigenvector #1 Normalized Eigenvector #2
0.5 0.5

0 0

−0.5 −0.5
1 10 1 10
Normalized Eigenvector #3 Normalized Eigenvector #4
0.5 0.5

0 0

−0.5 −0.5
1 10 1 10
Frame Number (i) Frame Number (i)

Figure 3.15: Eigenvalues of the \covariance" matrix  of sequences of parametrized


log-area vectors, and corresponding rst four eigenvectors.
3.3. TEMPORAL ANALYSIS 53

Original Area PCA Fourier


(a) (b) (c)
Area (cm2)

5 5 5

0 0 0
0 00 00 0
5 5 5
0.1
Tim 10 0.1 10 0.1 10 m)
e ( 0.2 15 0.2 15 0.2 15 (c
s) i t ion
s
Formant Frequency Trajectories Po
Frequency (kHz)

(d) (e) (f)


3 3 3
2 2 F3: 4.7% 2 F3: 11%
Max. Diff. F2: 9.8% Max. Diff. F2: 9.9%
1 1 F1: 7% 1 F1: 8%

0 0.1 0.2 0 0.1 0.2 0 0.1 0.2


Time (s) Time (s) Time (s)

Figure 3.16: (a) Sequence of area functions, taken from the corpus, corresponding to
the diphthong /ui/, uttered in the (French) sentence \Luis pense a ca." (b) Sequence
of areas reconstructed from the parametric principal component representation of the
original areas shown in (a). (c) Sequence of areas reconstructed from the parametric
Fourier representation of the original areas shown in (a). (d), (e) and (f) show formant
frequency trajectories corresponding to the sequences of areas shown in (a), (b) and (c)
respectively. The dashed lines shown in (e) and (f) are the original formant trajectories
shown in (d). For each pair of formant trajectories, the maximum relative di erence (in
percentage) is also shown.
Chapter 4
The Inverse Problem
\Tangible things
become insensible
to the palm of the hand."
Carlos Drumond de Andrade (1902{1987)
Memory

Before entering the details of the speech production inverse problem, it is inter-
esting to analyze it from a more generic point of view: The articulatory space can be
viewed as a space whose dimension N is larger than the dimension M of the acoustic
space. The problem is then to nd, among all the points in the articulatory space
that are mapped onto a given point in the acoustic space, the one that is the most
likely to occur. If such points de ne an articulatory subspace of dimension N M
(see Figure 4.1), the solution for the inverse problem can be divided in two parts:
(i) mathematical description of the N M dimensional articulatory subspaces, each
of them corresponding to one and only one point in the M dimensional acoustic
space; and (ii) formulation of a cost function to determine, for each subspace, the
point that is the most likely to occur. This is the procedure that will be described in
this chapter.
54
55

N Dimensional M Dimensional
Articulatory Space Acooustic Space

N-M Dimensional
Subspaces

Figure 4.1: Representation of the one-to-one relationship between the N M dimensional


subspaces that form the N dimensional articulatory space. Compare with gures 4.2
and 4.3, where the level curves are N M = 1 dimensional subspaces (contained in an
N = 2 dimensional articulatory space) which map onto an M = 1 dimensional acoustic
space.

The speech production inverse problem, i.e. the problem of estimating the vocal-
tract con guration from the speech signal, can be seen as a one-to-many non-linear
mapping. This mapping establishes the relationship between an articulatory space,
determined by all possible vocal-tract con gurations, and an acoustic space, deter-
mined by all possible speech signals.
The one-to-many characteristic comes from the fact that a given period of speech
can be generated by an in nite number of vocal-tract con gurations.
56 CHAPTER 4. THE INVERSE PROBLEM

The non-linear characteristic is inherent in the process of speech generation. How-


ever, its degree of complexity depends on the parameters chosen to represent both
articulatory and acoustic spaces.
The objective of this chapter is to analyze a restricted case of the speech pro-
duction inverse problem, namely the estimation of the cross-sectional area along the
vocal-tract (or, for simplicity, the area function) from the corresponding formant fre-
quencies. For this particular case, each area function is represented by one point in
the articulatory space, while each set of formant frequencies is represented by one
point in the acoustic space.
The rst diculty to solve this problem comes from the one-to-many characteristic
of the inverse problem, i.e. each point in the acoustic space is associated with many
points in the articulatory space. To cope with this fact, two kinds of constraints
can be invoked: the rst one is related to the morphology of the vocal-tract, which
determines the positions that can be reached and the e ort necessary to reach them.
Under this constraint, the inverse problem can be stated in the following way: Find,
among all points in the articulatory space associated with a given point in the acoustic
space, the point that is reached with minimum e ort1 by the vocal-tract.
Morphological constraints however, are essentially static and, therefore, cannot
account for co-articulation e ects such as anticipation and retention. In order to
cope with these e ects, a second kind of constraint can be invoked: it is related to
the patterns of motion of the vocal-tract, which can be called gestures, and can be
used to determine the trajectories that can be followed by the tract, and the e ort
necessary to execute each of them. At this point, it is possible to expand the concept
of an articulatory space, and think about an articulatory trajectory space which is
formed by all the gestures that can be generated by the human vocal-tract. Such a
space maps onto an acoustic trajectory space which is formed by all trajectories that
can be generated by the vocal-tract in the acoustic space. Under this point of view,
the inverse problem can be restated as: Find, among all points in the articulatory
trajectory space associated with a given point in the acoustic trajectory space, the
point that is produced with minimum e ort by the vocal-tract.
1 In the
case of human motor behavior, minimum e ort is a concept dicult to specify. In reality
it is a combination of facts which come from the command generation level in the brain down
to articulatory motion under physiological constraints. The e ort function used here is a simple
quadratic cost that takes into account vocal-tract morphological information.
4.1. ISOLATED FRAMES 57

4.1 Isolated frames


In this section, the mathematical formulation used to represent morphological con-
straints is described for the case of isolated frames. In the following section the
method is generalized for the case of trajectories.
The procedure can be divided in two parts: rst, a mathematical representation
for the \cost" of a given position of the vocal-tract is derived. After that, this cost
function is minimized under the acoustic constraint determined by a given set of
formant frequencies.

4.1.1 Representing Morphological Constraints


In order to keep a link with the works developed by Schroeder (1967) and Mermelstein
(1967), we start the explanation about morphological constraint representation using
the truncated Fourier cosine series parametrization of the log-area function. There
are many sets of Fourier cosine coecients that are associated with the same set
of formant frequencies (Mermelstein, 1967; Atal et al., 1978). For this reason, it
is necessary to impose constraints if we wish to estimate the area function from
the formants. As an example, the thick solid lines shown in the bottom panel of
Figure 4.2 represent level curves of the surface shown in the top left panel; which
shows the rst formant frequency (F1 in kHz) when the vocal-tract cross-sectional
area is represented by
Ak = exp[a1 cos K (k 21 ) + a2 cos 2K (k 21 )]; L = 17cm: (4:1)
It is seen that, for a given value of F1, there is an in nite number of combinations of
a1 and a2 associated with it. Another important fact is that, for a given a2 there is
one and only one a1 associated with a given formant frequency.
The same property is observed when and h are used to parametrize articulatory
and acoustic spaces. This fact is illustrated in the bottom panel of Figure 4.3, where
the thick solid lines show level curves of h1 expressed as a function of 1 and 2 when
all other components of are equal to zero. It is seen that each point h1 is associated
with a line in the plane de ned by 1 and 2.
From now on, since and h allow a more ecient parametrization than a and
f, most of the mathematical procedures carried out here will be based on and h.
58 CHAPTER 4. THE INVERSE PROBLEM

First Formant Cost Function

F1 (kHz) 1

Cost
0 6 6
−6 0 −6 0
0 6 −6a2 0 6 −6a2
a1 a1
Level Curves of F1 and of Cost Functions
6

2
a2

−2

−4

−6
−6 −4 −2 0 2 4 6
a1

Figure 4.2: Top left: the rst formant F1 as a function of the Fourier cosine coecients
a1 and a2; when all other coecients are equal to zero. Top right: paraboloidal surface
representing the cost function P a = at Ha a used to quantify the vocal-tract e ort.
Bottom: The solid thick lines show level curves of the surface shown in the top left panel.
(Compare with the general case in Fig. 4.1.) The solid thin ellipses show level curves of
the cost function shown in the top right panel. The dashed circles represent the particular
case when P y is an unweighted squared Euclidean distance (i.e. when Hy is an identity
matrix.)
4.1. ISOLATED FRAMES 59

First Acoustic Component Cost Function

Cost
h1

−6
6 6
6 6
γ2−6 −6 γ1 γ2−6 −6 γ1

Level Curves of h1 and of Cost Functions


6

2
γ2

−2

−4

−6
−6 −4 −2 0 2 4 6
γ1

Figure 4.3: Top left: the rst acoustic component h1 as a function of the principal
component coecients 1 and 2 ; when all other coecients are equal to zero. Top right:
paraboloidal surface representing the cost function P = t H used to quantify the
vocal-tract e ort. Bottom: The solid thick lines show level curves of the surface shown
in the top left panel. (Compare with the general case in Fig. 4.1.) The solid thin ellipses
show level curves of the cost function shown in the top right panel. The dashed ellipses
represent the particular case when P y is an unweighted squared Euclidean distance (i.e.
when Hy is an identity matrix.)
60 CHAPTER 4. THE INVERSE PROBLEM

Nevertheless, the counterpart procedures based on a and f can be obtained basically


by substituting by a and h by f.
Coming back to the topic of mathematical representation of morphological con-
straints, the (static) constraint considered here is based on the following optimization
problem: Given a vector of acoustic variables h, nd, among all possible vectors
of articulatory variables associated with h, the one that is the \closest" to the
\minimum e ort position." This position can be given by the neutral or the average
position of the vocal-tract: it is reasonable to assume that the neutral vowel position
corresponds to the minimum e ort position of the tract, since no active articulation is
being performed. It is also possible to think that the minimum e ort position corre-
sponds to the average position of the tract. If the neutral position is mathematically
interpreted as the point of maximum probability density in the articulatory space,
then it coincides with the average position if the probability density function of the
points in the articulatory space is symmetric, but may not coincide in other cases.
The neutral position seems to be more meaningful, but the average position is more
tractable from the mathematical point of view. This point will be addressed again
opportunely.
A mathematical formulation for a morphological constraint can be carried out in
the following way: a log-area vector y can be eciently parametrized by a vector
as already seen in the last chapter (Eq. (3.29))
y ' T y + 0y :
If the \minimum e ort position" is de ned by 0y , then a quadratic positional cost
P can be de ned as
P = tT y Hy T y ; (4.2)
which is an approximation for the quadratic form
P y = (y 0y )t Hy (y 0y ): (4.3)
Now, making
H = T y Hy T y ; (4.4)
yields
P = tH : (4.5)
4.1. ISOLATED FRAMES 61

H is a positive de nite matrix which contains information about the morphology of


the vocal-tract. It must be chosen so that natural positions of the tract result in a
low cost P , while positions incompatible with the morphological characteristics of
the tract result in a high cost P .
The simplest choice for H is obtained when Hy is taken as the identity matrix
of order (K +1). In this case, P is simply the square of the Euclidean distance
between the parametrized log-areas of a given vector y and 0y , the log-area vector
corresponding to the minimum e ort position. This is, however, not a good cost
function, since it gives equal weight to exible and rigid regions of the vocal-tract.
A geometric illustration of this case is given by the dashed ellipses shown in the
bottom panel of Fig. 4.3. However, the meaning of this unweighted case of the cost
function becomes clearer when it is represented by Fourier cosine coecients: the
dashed circles shown in the bottom panel of Fig. 4.2 represent level curves of P a
as a function of a1 and a2. It is possible to see that the minimal cost for a given
formant corresponds to the intersection of its level curve with the line a2 = 0. It
is interesting to note that Mermelstein (1967) used basically the same mathematical
constraint: for N = 7 and M = 3, fa0; a2; a4; a6g were kept equal to zero, while the
rst M = 3 formant frequencies fF1; F2; F3g were used to determine the rst M = 3
odd Fourier cosine coecients fa1; a3; a5g. (The problem of length determination
was not considered). Also interesting is the fact that the results obtained with this
simple and rather arti cial constraint are quite acceptable for some vowels, as shown
in Mermelstein (1967), and in the next chapter.
A more realistic possibility is obtained when Hy is taken as a diagonal matrix in
which the elements of the diagonal associated with the rigid regions of the tract are
large, while the elements associated with the exible regions are small. This approach,
combined with a smoothness constraint, was used by Yehia and Itakura (1993,1994)
with reasonable results. The weak point in this approach is that, although the local
characteristics of each region of the tract are well represented, the global articulatory
structure (morphology) of the tract is not taken into account. As an example, the
cost of a given position of the tongue apical region may depend on the position of the
tongue dorsal region.
In order to represent the interdependence between di erent regions of the tract,
the covariance between those regions must be taken into account. From a probabilistic
62 CHAPTER 4. THE INVERSE PROBLEM

point of view, given a corpus containing Q > N linearly independent log-area vectors,
if H is taken as the inverse of the covariance matrix of the log-area articulatory
vectors ,
1 Q
H = C ; where C ' Q 1 i ti;
X
1
(4.6)
i=1
then
P = tC 1 ; (4.7)
i.e. P becomes a squared Mahalanobis distance (Duda and Hart, 1973, pp.23{24).
Under the rather strong assumption of normal distribution, it means that, given an
acoustic vector h, minimization of P implies maximization of the probability of
occurrence of the corresponding .
It is interesting to note that the same H can be found by minimizing
Q Q
CP = Q1 P i = Q1 tiH i = trace(C H );
X X
(4.8)
i=1 i=1
the average of the costs of all articulatory vectors in the corpus, with respect to the
elements of H , under the constraint
jH j  jC 1j: (4.9)
The proof is based on the fact that the trace and the determinant are respectively
the sum and the product of the eigenvalues of C H . Then it is not dicult to show
that minimization of the trace under the above determinant constraint implies that
the eigenvalues must be all equal to 1 (one) and, hence, C H is an identity matrix.
Since C is a covariance matrix,
H = C 1

is de ned, and is a positive de nite matrix.


From a geometrical point of view, the level hypersurfaces of P are hyperellipsoids
whose principal axes are determined by the eigenvectors of C , the eigenvalues deter-
mining the length of these axes. An illustration for a two-dimensional case is given
by the ellipses shown in the bottom panel of Fig. 4.3, which represent level curves of
P as a function of 1 and 2.
4.1. ISOLATED FRAMES 63

The above derived cost function is able to cope with the articulatory e ects de-
termined by the morphology of the vocal-tract. However, it is important to note
that, in a wider sense, it can not be considered to be optimal. It is so because the
quadratic form adopted for P implies the existence of a single point of minimum,
corresponding to the minimum e ort position. Nevertheless, other stable positions,
corresponding to local minima, may exist and, if they are to be taken into account, a
more elaborated model is needed. Also, the probability distribution of points in the
articulatory space is not symmetric relatively to the mean. When solving the inverse
problem, it was observed that slightly better results are obtained when the quadratic
cost function has its center translated to the point of maximum probability density
in the articulatory space (i.e. the neutral position.) Finally, since P is essentially
a static constraint, it can not cope with co-articulation e ects. (This point will be
analyzed later.)
As a nal comment note that, for the case of Fourier cosine components, following
the same procedure used for the principal components, the cost function is given by
Q
P a = (a a)t Ha (a a); Ha = Ca 1 = Q 1 1 (ai a)(ai a)t: (4.10)
X

i=1

4.1.2 Solving the inverse problem


For a given acoustic vector h, it is now possible to derive a procedure to estimate its
articulatory counterpart, represented by vector , under the morphological constraint
described above. The method is as follows.

Relationship between acoustic and articulatory variables


A variation in the acoustic vector h is locally linearly related to a variation in the
articulatory vector  (Mermelstein, 1967). Thus, it is reasonable to assume that,
for suciently small variations,2

a( ) = h; (4.11)


h( +  ) = h( ) + a( ) ; (4.12)
2 In
strict terms, the equality holds only for in nitesimal variations. However, this relation is
approximately true even for fairly large variations, as seen in Fig. 3.12.
64 CHAPTER 4. THE INVERSE PROBLEM

where a is the Jacobian matrix that gives the partial derivatives @hj =@ i for a
given :
a( ) = dd h ; (4.13)
which is an M  N matrix (the number of rows is equal to the number of acoustic
components M , and the number of columns is equal to the number of articulatory
components N ). Figure 3.12 gives an idea of the \degree of linearity" between h
and . It shows the M = 3 acoustic variables h in terms of the N = 5 articulatory
variables: 1; 2; : : :; 5.
There are two important facts to be noted here: 3
1. There is a quasi-linear relationship between h and  .
2. There is a one-to-one relationship between [ 1; 2; 3] and [h1; h2; h3]. (In fact,
what is apparent in Figure 3.12 is that h1; h2 and h3 are monotonically increasing
functions of 1, 2 and 3, respectively.)
This one-to-one relationship leads to the following speculation: taking 1 as the vector
formed by the rst M articulatory components of ,
2 3
@h1 @h1 @h1
@ 1 @ 2 : : : @ M
6 7
@h2 @h2 : : : @h2
@ h
6 7

1 ( ) = @ 1 = .. 1 @ .. 2 . . @ ..M
6 7
@ 6
6
7
7 (4.14)
. 6
6
4
. . . 7
7
5
@hM @hM : : : @hM
@ 1 @ 2 @ M
is not singular. In fact, numerical tests with log-area functions indicate that det(1 )
is practically always positive. Exceptions do exist, but did not cause problems during
the cases analyzed until now.
Therefore, under the assumption that det(1 ) 6= 0, it is possible to divide
 = [ 1;  2; : : : ;  M ;  M +1; : : : ;  N ]t; (4.15)
into two subvectors:  1containing the rst M components of  , and  2 con-
taining the remaining components
 1 = [ 1;  2; : : :;  M ]t; (4.16)
 2 = [ M +1;  M +2; : : : ;  N ]t; (4.17)
3 Similar conclusions can be taken from Figure 3.3 for the case of Fourier cosine coecients and
formant frequencies.
4.1. ISOLATED FRAMES 65

where M is the number of acoustic components; and express  1 in terms of h


and  2 as follows.
a = h; (4.18)
1  1 + 2  2 = h; (4.19)
 1 = 1 1h 1 1 2 2; (4.20)
where 1 was already de ned in Eq. (4.14), and
2 3
@h1 @h1
: : : @@h N1
6 @ M +1 @ M +2 7
@h2 @h2
: : : @@h N2
@ h
6 7

2 ( ) = @ 2 =
6
@ M +1 @ M +2 7

. . . ... ; (4.21)
6 7
6
6
6
... ... 7
7
7
4 5
@hM @hM @hM
@ M +1 @ M +2 : : : @ N
where 2 is given by the last N M components of . (The equations above are in
general terms. In the case under analysis, M = 3 and N = 5.)

Combining acoustic and morphological constraints


The above relation can now be used as an acoustic constraint for the cost function P .
By minimizing P [ +  ( 2)] with respect to  2 (using  1 = 1 1 h
1 12  2), it is possible to nd  min that minimizes P ( +  ) under the
acoustic constraint
a = h:
The mathematical formulation is as follows. Given
P [ ( 2 )] = [ ( 2 ) + 0]tH [ ( 2) + 0]; (4.22)
where 0 is the neutral position4 , nd the family of vectors  for which
dP = 0: (4.23)
d 2
Doing the calculation,
dP = d t dP
" #

d 2 d 2 d (4.24)
t
d 
"

#

= d [H + H t]( + 0):
2
4 As mentioned before, better results were obtained when the neutral position was used instead
of the average position as the minimum e ort position.
66 CHAPTER 4. THE INVERSE PROBLEM

Here, the derivative of  with respect to  2 will be called  0 and is given by


d 
2
 1 3

 = d =
0 4 1 2 ; 5 (4.25)
2 IN M
where IN M is the identity matrix of order N M . Now, since H is symmetric (it
is a covariance matrix),
dP = 2 0tH ( + ) (4.26)
0
d 2
and, nally, dP =d 2 = 0 implies that
 0tH  =  0tH ( 0); (4.27)
which can be rewritten as
p = p( 0); (4.28)
where
p =  0tH : (4.29)
The linear system above gives the necessary N M equations to complete the un-
derdetermined system given by Eq. (4.11):
a =
9
h =
=)  = h =)  =  1h; (4.30)
p  = p( 0) ;

where

 = a h
2 3 2 3

and h =
p ( 0) : (4.31)
4 5 4 5

p
Iterative solution
For larger variations, the system above is an approximation, since  depends on .
However, as seen in Fig. 3.12, there is a high degree of linearity in this non-linear
system. In the experiments performed, it was successfully solved by the following
Newton-Raphson procedure
4.1. ISOLATED FRAMES 67

1 = (h1; H ) n Function to compute 1 from h1 and prior information H . n


f
= 0 ; n Initialize to the neutral position and compute n
h = h( 0); n the corresponding acoustic vector. n
h = h1 h;
while (k h k > ")
f
 = ( ; H ); n Calculate new , n
h = h(h; ; ); n h, n
 =  1h; n and  . n
= +  ; n Update . n
h = h( ); n Update h. n
h = h1 h;
g
1 = ; n Return 1. n
g
k  k is a norm function. In the implemented system it is the maximal deviation
between the desired and obtained acoustic vectors (h). " is an error criterion. A
value around 0:01 (1%) was found to be a good compromise between precision and
computation cost. The input h1 is obtained from the rst M formant frequencies using
Eq. (3.33), while the area vector is obtained from the output 1 using Eq. (3.29). For
the analyzed cases (see next chapter) this procedure took on average three iterations
to converge, the hardest cases taking six iterations.
68 CHAPTER 4. THE INVERSE PROBLEM

4.2 Trajectories
The log-area function moves smoothly due to dynamical constraints. Here, instead of
modeling physically the dynamics of the vocal-tract, only a constraint of smoothness
in time will be imposed. The dynamic case can be derived as an expansion of the
static case. In the end of Chapter 3 , it was seen that, for a given time interval,
the trajectories of the components of vector (or a, in the case of representation by
Fourier cosine series) can be approximated by a linear combination of eigenvectors
whose components follow the approximate shape of cosine functions, as expressed by
Eq. (3.42) rewritten below as
i = iVt  + Ei; (4.32)
where Ei is the approximation error matrix, i is a matrix whose columns are a
sequence of p articulatory vectors ( i; : : :; i+p 1), and the q columns of V  are
eigenvectors whose sum, weighted by the coecients contained in i; approximates 
i.
Now, using Eq. (4.30), and dropping the time index i for convenience, it is possible
to write
M G = H; (4.33)
where
( i)
2 3
0 ::: 0
M=
6 7
6
6 0 ( i+1 ) : : : 0 7
7
(4.34)
... ... ...
6 7
6
6
6
... 7
7
7

0 0 : : : ( i+p 1)
4 5

is a \matrix of matrices" containing the locally linear relation between a sequence of


articulatory variations and their acoustic and cost counterparts;5
2 3

6
 i 7
6
6  i+1 7
7
G = 6
6
6 ...
7
7
7
(4.35)
6 7

 i+p
4 5

1
5 Compare with Eq. (4.30) and observe that the dependence of  on is made explicit here.
4.2. TRAJECTORIES 69

contains a sequence of articulatory variations  i vertically arranged as a column


vector; and
2 3

6
hi 7

H =
6
6 hi+1 7
7
(4.36)
...
6 7
6 7
6 7
6 7

hi+p
4 5
1

contains a sequence of vectors hi, vertically arranged as a column vector, com-


posed by the acoustic and positional cost constraints associated with the articulatory
variation vectors contained in G .
The relation given by Eq. (4.32) can be applied to Eq. (4.33) yielding

MV X = H + E; (4.37)

where
2 3

6
v11 : : : v1q 0 : : : 0 ::: 0:::0 7
6
6 0 : : : 0 v11 : : :v1q ::: 0:::0 7
7
6
6
6
... ... ... ... 7
7
7
6 7

0:::0 0:::0 : : : v11 : : : v1q


6 7
6 7
6 7
6 7
6
6
6
v21 : : : v2q 0 : : : 0 ::: 0:::0 7
7
7
6
0 : : : 0 v21 : : :v2q ::: 0:::0 7

V=
6 7
6
6
6
... ... ... ... 7
7
7
6 7
(4.38)
0:::0 0:::0 : : : v21 : : : v2q
6 7
6 7
6 7
6
6
6
... ... ... ... 7
7
7
6
6
6
... ... ... ... 7
7
7
6 7

vp1 : : : vpq 0 : : : 0 ::: 0:::0


6 7
6 7
6 7
6 7
6
6 0 : : : 0 vp1 : : :vpq ::: 0:::0 7
7
6
6
6
... ... ... ... 7
7
7
4 5

0:::0 0:::0 : : : vp1 : : : vpq

is a Np  Nq matrix whose nonzero elements vij are the entries of V . Each row of
70 CHAPTER 4. THE INVERSE PROBLEM

V
V  is repeated N times in the matrix shown above.
2 3

6
11 i 7
6
6 12 i 7
7
6
6
6
... 7
7
7
6 7

1q i
6 7
6 7
6 7
6 7
6
6
6
21 i 7
7
7
6
6 22 i 7
7
6
6
6
... 7
7
7
X = 6 7
(4.39)
2q i
6 7
6 7
6 7
6
6
6
... 7
7
7
6
6
6
... 7
7
7
6 7

N 1 i
6 7
6 7
6 7
6 7
6
6 N 2 i 7
7
6
6
6
... 7
7
7
4 5

Nq i
contains the columns of a variation  i rearranged in a column vector and, nally,
E is an Np  1 column vector containing the approximation error.
Eq. (4.33) de nes an overdetermined system which can be solved by minimizing
a weighted version of the squared error
(X ) = E t H E
MV MV
(4.40)
t
X H H
h i h i
= X H ; (4.41)
where H is an Np  Np positive de nite (Horn, 1985, p. 250) matrix which can
be used to give di erent weights to acoustic and morphological constraints, and to
di erent subintervals of the speech interval under analysis. It may be interesting when
dealing with more complex speech intervals, or when studying the trade-o between
e ort and acoustic accuracy in speech (Lindblom, 1990). However, in the present
status of this study, such possibilities are not being explored yet, and H is being
taken as an identity matrix.
Minimization of  with respect to X is carried out as follows
d = 0
dX
4.2. TRAJECTORIES 71

=)
h
MV H MV + MV H MV = 0
i t h i h it h i

MV (H + H ) MV = 0
 X H X H 
h i t h i
=) t

MV (H + H ) MV = MV (H + H )
  X H
h i t h i h it
=) t X t
 H

= MV (H + H ) MV MV (H + H ) (4.42)
  
h it h i 1h i t
=) X  
t

t
 H

and, for the particular case of H being an identity matrix,

X =
h

MV MV MV t
i h i 1h i t
H: (4.43)

Iterative solution
As in the case described in Section 4.1, the equality of Eq. (4.43) holds only for su-
ciently small variations. For larger variations, the problem is solved by the same kind
of Newton-Raphson procedure used in the previous section. The acoustic components
of H are initialized by the trajectory given by the di erence between a given sequence
of acoustic vectors6 and the acoustic vector determined by the articulatory neutral
position. The minimum cost components of H are initially set to zero, and adapted
as the articulatory trajectory vector X changes during the iterative procedure. For a
given X ,  i is obtained by simply rearranging the entries of X as (see Eq. (4.39))
2 3
11i : : : 1qi
 = ...
6
6
6
4
... ;
7
7
7
5
(4.44)
N 1i : : : Nqi
and the corresponding articulatory trajectory is approximated by (see Eq.(3.42)

i ' iVt  + 0 ; (4.45)

where 0 is the \neutral trajectory" determined by the vocal-tract sustaining the


neutral position along the analyzed interval. Finally, the log-area vector trajectory is
obtained from (see Eq. (3.29))

Y ' T y + 0y : (4.46)


6 The acoustic vectors are determined from formant log-frequency vectors using Eq. (3.33).
72 CHAPTER 4. THE INVERSE PROBLEM

Comment
Now it is possible to estimate a plausible articulatory trajectory from a given acous-
tic trajectory. Conceptually, the method looks consistent. However, it must not
be forgotten that the vocal-tract articulation e ort was arbitrarily represented by a
quadratic cost function, which is convenient from the mathematical point of view,
but has no physiological base to be adopted. From the results presented in the next
chapter, it will be possible to evaluate the performance of the method under the
limitations imposed by this admittedly arti cial e ort measure.
Chapter 5
Results and Discussion
\Computers are useless,
they only give answers."
Pablo Ruiz y Picasso (1881{1973)

Some results obtained with the method described in the previous chapter are given
here. The isolated frame procedure is analyzed in the rst section. The second sec-
tion analyzes the case of trajectories. Not surprisingly, the results obtained are not
always perfect, since the quadratic cost function used to measure the e ort of each
vocal-tract position (or, in a broader sense, trajectory) was chosen more because it
allows a simple mathematical minimization procedure than for physiological charac-
teristics of the vocal-tract. Nevertheless, the results are enough to show that using
a simple relation between acoustic and articulatory parameters it is possible to rep-
resent acoustic constraints in the articulatory space, and combine them directly with
minimum e ort and continuity constraints.
73
74 CHAPTER 5. RESULTS AND DISCUSSION

5.1 Isolated Frames


The procedure for inversion of the articulatory-to-acoustic mapping for isolated frames
described in the previous chapter was applied for the oral vowel frames contained in
the corpus. All 258 analyzed frames are shown in Appendix B. Some selected frames
are used here to interpret the results.
The four frames shown in Fig. 5.1 show good results obtained for di erent vow-
els. In general, the good agreement between original areas (derived from midsagittal
distances) and estimated areas (obtained from the formant frequencies determined
by the original areas) prevails in most of the frames analyzed. In spite of that, large
discrepancies also exist, and is exactly the comprehension of the di erent types of
discrepancies that may allow future improvements for the system.
Four types of error are shown in Fig. 5.2. From left to right, the rst column
shows the case of excessively open lips. A possible explanation for that comes from
the fact that the inversion procedure is carried out in the log-area domain, what
makes large areas less sensitive to errors than small areas. From the acoustic point of
view it makes sense, since formant frequency variations also depend on the log-area
rather than on the area function itself. However, from the articulatory point of view,
considerably large, sometimes unacceptable, errors can occur.
The second column shows the case of an excessively large oral cavity behind a very
narrow constriction at the lips. From the acoustic point of view, quite large variations
on wide areas behind narrow constrictions have small e ects on formant frequencies.
From the articulatory point of view, the quadratic cost function used to evaluate
vocal-tract e ort to reach a position seems to be very \mild" with respect to large
variations in the oral cavity. A possible explanation for that comes from the axial
symmetry of the quadratic cost function: since complete closures are accomplished
in the oral cavity with little e ort, and since a complete closure is associated with a
\minus in nite log-area," very large oral areas become also associated with little e ort.
The clipping procedure to avoid closures used in Chapter 2 during the construction of
the log-area corpus apparently was not enough to avoid this discrepancy. A possible
solution for this problem would be the use of an asymmetric cost function, but this
would turn the cost minimization procedure more complex. Another possibility is to
use continuity constraints in time. This is the subject of the next section. Although
an exhaustive analysis of trajectory estimation has not been carried out yet, the cases
5.2. TRAJECTORIES 75

analyzed did not present this kind of discrepancy.


The third column of Fig. 5.2 shows a very underestimated vocal-tract length. A
clear explanation for this mistake has not been found yet, but this is one more case
where continuity constraints in time could help.
The right column of Fig. 5.2 shows a case where an underestimation of the vocal-
tract length is compensated by a partial lip closure. This kind of compensation is not
unlikely to happen in real speech. This error shows that, even under morphological
constraints represented correctly, more than one plausible shape of the vocal-tract
can produce the same set of formants.
Finally, we call attention to the fact that, even when the inversion procedure
failed in estimating the correct area function, the transfer functions associated with
original and estimated areas match fairly well. This fact indicates that most of the
articulatory errors observed have small acoustic e ects.

5.2 Trajectories
Adding the continuity constraints explained in Section 4.2 to the method of combi-
nation of acoustic and morphological information described in the Section 4.1, it was
possible to estimate sequences of area functions from the corresponding rst three
formant trajectories.
Some characteristics of the inversion procedure are illustrated in the following
way. In the example given in Fig. 5.3, the sequence of area functions shown in the
top panel was used to generate the formant trajectories shown in the bottom panel.
These trajectories were then used to recover the original sequence of areas, under
minimum e ort and continuity constraints. The result is shown in the top panel of
Fig. 5.4. The search for the best sequence of areas was performed in the articulatory
trajectory \ " space. Note, however, that the sequence of areas shown in Fig. 5.4
is close, but not identical, to that shown in Fig. 5.3. A possible reason for this is
associated with the fact that the mathematical cost function used does not perfectly
re ect the articulation e ort determined by the human physiology. Another reason
is that the parametrization procedure allows only an approximated reconstruction of
the original sequence of areas.
For comparison purposes, it is interesting to see the results when the same problem
76 CHAPTER 5. RESULTS AND DISCUSSION

1 Speech Signal PB0311 1 Speech Signal PB2854 1 Speech Signal PB1560 1 Speech Signal PB1754
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB0311 Power Spectrum (dB) PB2854 Power Spectrum (dB) PB1560 Power Spectrum (dB) PB1754
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB0311 Area Function PB2854 Area Function PB1560 Area Function PB1754

Area (cm2)
Area (cm2)

Area (cm2)

Area (cm2)
5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsagittal Distances PB0311 Midsagittal Distances PB2854 Midsagittal Distances PB1560 Midsagittal Distances PB1754
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB0311 Vocal−Tract Profile PB2854 Vocal−Tract Profile PB1560 Vocal−Tract Profile PB1754
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm
cm

cm

cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm

Figure 5.1: Results obtained with the inversion technique for isolated frames. In each
column, the central panel shows original (thin line) and estimated area (thick line). The
estimated area is obtained from the formant frequencies determined by the original area.
Vocal-tract pro le, midsagittal distances, transfer functions and speech signal are also
shown for reference purposes. From left to right the columns correspond to the neutral,
and French /a/, /i/, and /u/ vowels.
5.2. TRAJECTORIES 77

1 Speech Signal PB0216 1 Speech Signal PB0230 1 Speech Signal PB0334 1 Speech Signal PB1518
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB0216 Power Spectrum (dB) PB0230 Power Spectrum (dB) PB0334 Power Spectrum (dB) PB1518
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB0216 Area Function PB0230 Area Function PB0334 Area Function PB1518

Area (cm2)
Area (cm2)
Area (cm2)

Area (cm2)

5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)

Midsag. Dist. (cm)


Midsag. Dist. (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB0216 Midsagittal Distances PB0230 Midsagittal Distances PB0334 Midsagittal Distances PB1518
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB0216 Vocal−Tract Profile PB0230 Vocal−Tract Profile PB0334 Vocal−Tract Profile PB1518
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm
cm
cm

cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm

Figure 5.2: Problems with the inversion procedure. The columns show the following cases.
Left: French /a/ with excessively open lips. Center-left: French /u/ with excessively large
front cavity. Center-right: French /i/ with excessively short length. Right: French /e/
with excessively closed lips compensating underestimated length. As in Fig. 5.1, in the
central panel of each column, the thin line is the original area and the thick line is the
area estimated from the formant frequencies determined by the original area.
78 CHAPTER 5. RESULTS AND DISCUSSION

is solved using a truncated Fourier series to represent the log-area function. We start
with the analysis carried out by Mermelstein (1967), who parametrized the vocal-tract
log-area function by the rst six coecients of its Fourier cosine series expansion. It
was veri ed that, when the even coecients are all equal to zero, as already mentioned,
there exists a one-to-one relationship between the rst three formant frequencies
and the three odd Fourier coecients. Using this property, an interactive procedure
was implemented to nd the unique set of odd Fourier coecients associated with a
given set of formant frequencies, when all even Fourier coecients are equal to zero.
This procedure was used to obtain the sequence of areas shown in Fig. 5.5 from the
formant trajectories shown in Fig. 5.3. Note that the result is substantially di erent
from the original sequence of areas (top panel of Fig. 5.3). This con rms the fact
that setting all even Fourier coecients to zero is an arti cial constraint that does
not re ect the geometrical constraints determined by the vocal-tract morphology.
When the mathematical framework described in the previous chapter to incorporate
such morphological constraints is used with the Fourier representation described in
Chapter 3, the sequence of areas shown in Fig. 5.6 is obtained. It can be seen
that it resembles the original sequence of areas shown in Fig. 5.3. However, abrupt
variations, inherent in some regions of the vocal-tract, cannot be well approximated,
due to the smooth character of the cosine functions, which form the basis of the
Fourier cosine series representation. This is in contrast with the eigenvectors used
in the principal component representation, which allow a good representation of the
vocal-tract structure. When the principal component representation is used in place
of the Fourier representation, the result obtained is the sequence of areas shown in
Fig. 5.4.
As a nal observation, the similarity of the formant frequency trajectories associ-
ated with the sequences of areas shown in Figures 5.3, 5.4, 5.5, and 5.6, show that,
even under continuity constraints, substantially di erent sequences of area functions
can generate basically the same formant trajectories.
5.2. TRAJECTORIES 79

Original Area
5
Area (cm2)

0
0 0
5
0.1 10
15
0.2 Position (cm)
Time (s)

Formant Frequency Trajectories


Frequency (kHz)

0 0.1 0.2
Time (s)

Figure 5.3: Top: Sequence of area functions, taken from the corpus, corresponding to
the diphthong /ui/, uttered in the French sentence \Luis pense a ca." Bottom: Formant
frequency trajectories corresponding to the sequences of areas shown in the top panel.
80 CHAPTER 5. RESULTS AND DISCUSSION

Principal Components
5
Area (cm2)

0
0 0
5
0.1 10
15
0.2 Position (cm)
Time (s)

Formant Frequency Trajectories


Frequency (kHz)

2
F3: 2.6%
Max. Diff. F2: 6.4%
1
F1: 7.8%

0 0.1 0.2
Time (s)

Figure 5.4: Top: Sequence of areas estimated from the formant trajectories shown in
the bottom panel of Fig. 5.3, under continuity and minimum e ort constraints. Bottom:
The solid lines show the formant frequency trajectories corresponding to the sequences of
areas shown in the top panel. The dashed lines reproduce the original formant trajectories
shown in the bottom panel of Fig. 5.3.
5.2. TRAJECTORIES 81

Fourier (Odd Terms)

5
Area (cm2)

0
0 0
5
0.1 10
15
0.2 Position (cm)
Time (s)

Formant Frequency Trajectories


Frequency (kHz)

2
F3: 4.7%
Max. Diff. F2: 6.5%
1
F1: 3.2%

0 0.1 0.2
Time (s)

Figure 5.5: Top: Sequence of area functions estimated from the formant trajectories
shown in Fig. 5.3 under the following constraint: The areas are represented by the rst
six components of its Fourier cosine series expansion with the even coecients set to
zero. Bottom: The solid lines show the formant frequency trajectories corresponding to
the sequences of areas shown in the top panel. The dashed lines reproduce the original
formant trajectories shown in the bottom panel of Fig. 5.3.
82 CHAPTER 5. RESULTS AND DISCUSSION

Fourier (All Terms)


5
Area (cm2)

0
0 0
5
0.1 10
15
0.2 Position (cm)
Time (s)

Formant Frequency Trajectories


Frequency (kHz)

2
F3: 3.2%
Max. Diff. F2: 5.4%
1
F1: 2.1%

0 0.1 0.2
Time (s)

Figure 5.6: Top: Sequence of area functions estimated from the formant trajectories
shown in Fig. 5.3 under the following constraint: The areas are represented by the rst
nine components of its Fourier cosine series expansion determined under morphological
and continuity constraints. Bottom: The solid lines show the formant frequency trajec-
tories corresponding to the sequences of areas shown in the top panel. The dashed lines
reproduce the original formant trajectories shown in the bottom panel of Fig. 5.3.
5.3. QUANTITATIVE ANALYSIS 83

5.3 Quantitative Analysis


The qualitative analysis done in sections 5.1 and 5.2 was important to understand
the limitations of the method as well as the reasons for the discrepancies observed.
In this section, a quantitative analysis based on the 258 frames of the corpus that
correspond to oral vowels is carried out. The objective is to give a measure of the
performance of the method and to understand its global behavior.
The rst step in this analysis is to verify the in uence of the parametrization
procedure on the original cross-sectional areas. It is illustrated in Fig. 5.7a, where
all cross-sectional areas in logarithmic scale1 recovered from the parametric represen-
tation as vectors are plotted against their original counterparts. The correlation
coecient2 (Papoulis, 1991, p. 152) of 0.965 indicates that the parametric represen-
tation is good, but the error implied by it is not negligible. Looking at Table 5.1, the
mean relative error of 0:9 % indicates that, globally, the parametrization procedure
does not cause any signi cant bias. The standard deviation of the relative error of
21 % is indeed signi cant, but still acceptable. In particular, note that the errors due
to parametrization caused only small deviations in the acoustic space (see Fig. 5.7e
and Table 5.3).
Next, original areas and areas estimated from isolated frames are compared (Fig.
5.7b and Table 5.1). A reasonably good correlation coecient of 0.828 is obtained.
Note, however, the minimal error in the acoustic space (Fig. 5.7f and Table 5.3). The
standard deviation of the relative error of 56 % is high, but lower bounded by the 21 %
standard deviation parametrization error. Also, the very good matching observed in
the acoustic space a ects the accuracy of the areas estimated in the articulatory
space.
When sequences of log-area vectors are estimated instead of isolated vectors, the
correlation coecient increases only marginally from 0.828 to 0.832 (Fig. 5.7c and
Table 5.1). Nevertheless, observing the scattering shown in Fig. 5.7d and the standard
deviation of the relative error of 17 % plotted in Table 5.1, it is seen that the the areas
estimated are signi cantly di erent. The estimation based on sequences of vectors
1 There are 258 log-area vectors, each of them containing 32 log-areas. So, each scattering in
the top row of Fig. 5.7 contains 258  32 = 8256 points. In the bottom row, since there are three
formants per vector, each scattering contains 774 points.
2 The correlation coecient was computed in logarithmic scale as
p
E[log A1 log A2]= E[(log A1)2 ]E[(log A1 )2 ].
84 CHAPTER 5. RESULTS AND DISCUSSION

yields, naturally, smoother trajectories of log-area vectors. The price paid for that
is a small degradation in the matching observed in the acoustic space (Fig. 5.7g and
Table 5.3).
Finally, the results obtained for length estimation (Table 5.2) show correlation
coecients considerably lower than those obtained for cross-sectional areas. This
fact indicates the need for a more appropriate method to handle length information.
The small values observed for the standard deviation of the relative error are due to
the fact that length variations are small compared with total vocal-tract length. This
is in contrast with cross-sectional areas, which vary from values very close to zero up
to several square centimeters.
Summarizing, the high correlation coecients observed in the acoustic space con-
rm that the acoustic constraint imposed by the formant vectors is respected dur-
ing the log-area vector estimation. In the articulatory space, correlation coecients
around 0.83 indicate that the model works, but still has to be improved.

Table 5.1: Numerical Results: Areas


Mean Di . Std. Dev. Corr. Coef
Parametrized vs. Original 0.9 % 21 % 0.965
Isolated Frames vs. Original 4.1 % 56 % 0.828
Trajectories vs. Original 3.9 % 54 % 0.832
Isolated Frames vs. Trajectories -0.2 % 17 % 0.979
Trajectories vs. Parametrized 3.0 % 45 % 0.872

Table 5.2: Numerical Results: Length


Mean Di . Std. Dev. Corr. Coef
Parametrized vs. Original 0.44 % 0.7 % 0.987
Isolated Frames vs. Original -0.19 % 4.2 % 0.607
Trajectories vs. Original 0.19 % 3.9 % 0.635
Isolated Frames vs. Trajectories -0.38 % 1.9 % 0.928
Trajectories vs. Parametrized -0.25 % 4.0 % 0.603
5.3. QUANTITATIVE ANALYSIS 85

Table 5.3: Numerical Results: Formants


Mean Di . Std. Dev. Corr. Coef
Parametrized vs. Original -0.40 % 1.5 % 0.99925
Isolated Frames vs. Original -0.04 % 0.2 % 0.99999
Trajectories vs. Original -0.10 % 1.3 % 0.99936
Isolated Frames vs. Trajectories 0.06 % 1.3 % 0.99937
Trajectories vs. Parametrized 0.30 % 1.8 % 0.99884

Param. vs. Orig. Isol. Frm. vs. Orig. Traject. vs. Orig. Isol. Frm. vs. Traject.
100 (a) 0.965 100 (b) 0.828 100 (c) 0.832 100 (d) 0.979
Area (cm2)

10 10 10 10

1 1 1 1

0.1 0.1 0.1 0.1

0.01 0.01 0.01 0.01


0.01 0.1 1 10 100 0.01 0.1 1 10 100 0.01 0.1 1 10 100 0.01 0.1 1 10 100
Area (cm2) Area (cm2) Area (cm2) Area (cm2)

10 10 10 10
(e) 0.99925 (f) 0.99999 (g) 0.99936 (h) 0.99937
Freq. (kHz)

1 1 1 1

0.1 0.1 0.1 0.1


0.1 1 10 0.1 1 10 0.1 1 10 0.1 1 10
Freq. (kHz) Freq. (kHz) Freq. (kHz) Freq. (kHz)

Figure 5.7: (a) Scattering of the cross-sectional areas obtained from the parametric
principal component representation of the original areas plotted against their original
counterparts. The scattering of the formant frequencies derived from the areas is shown
in (e). (b) Cross-sectional areas estimated from formant vectors in the case of isolated
frames plotted against original areas. The formant frequencies derived from the areas are
plotted in (f). (c) Cross-sectional areas estimated from formant vector trajectories plotted
against original areas. The formant frequencies derived from the areas are plotted in (g).
(d) Cross-sectional areas estimated from isolated frames plotted against areas estimated
from formant vector trajectories. The formant frequencies derived from the areas are
plotted in (h). The correlation coecients are given in the top right corner. The 258 oral
vowel frames available in the corpus were used to generate the scatterings.
Chapter 6
Conclusion
\Words are words."
William Shakespeare (1564{1616)
Othello Act I Scene III

In this study, a method to combine di erent pieces of information in a restricted


case of the speech production inverse problem, namely the formant-to-area determi-
nation problem, was presented. The initial formulation is based on a Fourier analysis
of the vocal-tract log-area function, already described by Mermelstein (1967). The
novelty is that vocal-tract morphological constraints are invoked to cope with the
underdetermined problem of obtaining a complete set of log-area Fourier coecients
from formant frequencies. After that, the Fourier representation is substituted by an
optimal principal component representation of the log-area function which allows a
better characterization of the vocal-tract. As a nal point, the analysis is generalized
from isolated frames to trajectories of log-area parameters. This allows a natural
implementation of continuity constraints in the articulatory domain.
86
87

The implemented system uses a Newton-Raphson iterative procedure to solve the


non-linear system that arises in the framework formulation. The solution took on
average four iterations to converge, usually but not always, to a position close to the
right solution. It is, in principle, more ecient than analysis-by-synthesis techniques
(Shirai and Kobayashi, 1986; Schroeter and Sondhi, 1991) that require a much larger
number of iterations. Also, it gives a better insight into the problem than the neural
network (Shirai, 1993) and the genetic algorithm (McGowan, 1994) approaches.
The main weak point of the method developed in this study is the limited exibility
of the cost function chosen to quantify the vocal-tract e ort during speech production.
The quadratic form used has the merit of allowing a simple minimization procedure,
but does not re ect well the vocal-tract e ort for positions far from the neutral
articulatory position. In spite of that, the method worked satisfactorily for most of
the analyzed cases.
From the experimental results, since there was always a very good match between
reference and estimated transfer functions (at least up to the third formant region);
and since the matching between reference and estimated areas was not perfect; it is
possible to conclude that the regions that did not match well were mainly the regions
that have little in uence on the vocal-tract acoustic response (up to the third formant
region).
One important point observed during the analysis is that, when appropriately
represented, the mapping between articulatory and acoustic properties of the human
vocal-tract is not complex, having a dominant linear component.
Another interesting conclusion that can be drawn is that, since the obtained trans-
fer functions were derived only from the formant frequencies, making use of prior in-
formation about morphological and continuity constraints; and since a good spectral
matching was obtained (up to the third formant region); it is possible to say that, if
morphological information is available, it is possible to derive the vocal-tract transfer
function from the formant frequencies (cf Fant, 1956). It remains to be shown if the
human being makes use of such redundancy and, if so, in what way.
Appendix A
Numeric Information
Most of the numeric information used in the implementation of the vocal-tract para-
metric model described in this study was not included in the main text. Instead, for
practical purposes, it is given in this appendix, and can be used by the interested
reader to implement, test and analyze the model proposed.
In order to do this, some observations are important: The rst one is that the
tract length is expressed in normalized units, which can be converted into centimetres
as follows

1 length unit = 0:534 cm:

The second observation is about the procedure used to \ ll" the articulatory space in
Section 3.2.2: rst, a suciently high number of points is uniformly generated in the
hyperrectangle de ned by min and max. After that, the corresponding log-area
vectors are calculated, and those that exceed the limits de ned by ymin and ymax
are discarded, since they probably correspond either to unrealistic area functions or
to areas with constrictions that are too narrow. The nal observation is about the
procedure used to estimate the formants associated with a given area function: they
can be determined using the wave propagation model described in Section 2.2.3.
88
89

2 3
0:006 0:019 0:004 0:004 0:052
6 0:010 0:035 0:026 0:010 0:130 7
6
6 0:017 0:045 0:036 0:016 0:177 7
7
6 0:030 0:026 0:012 0:110 0:142 7
6
6 0:039 0:050 0:002 0:030 0:000 7
7
6 0:057 0:087 0:023 0:013 0:136 7
6 0:083 0:082 0:029 0:004 0:142 7
6
6 0:101 0:071 0:042 0:034 0:161 7
7
6 0:111 0:054 0:053 0:059 0:159 7
6
6 0:113 0:033 0:059 0:079 0:137 7
7
6 0:105 0:014 0:060 0:088 0:112 7
6
6 0:091 0:001 0:058 0:090 0:091 7
7
6 0:074 0:009 0:052 0:084 0:080 7
6 0:072 0:017 0:048 0:106 0:102 7
6
6 0:078 0:043 0:056 0:154 0:159 7
7
0:076 0:059 0:055 0:170 0:170
U y = 6
6
6
6
0:056
0:011
0:136
0:226
0:063
0:092
0:208
0:233
0:177
0:199
7
7
7
7
;
6 0:043 0:220 0:128 0:199 0:232 7
0:102 0:205 0:135 0:131 0:227
6 7
6 7
6 0:184 0:216 0:136 0:047 0:217 7
6
6 0:264 0:205 0:130 0:048 0:192 7
7
6 0:307 0:169 0:099 0:120 0:129 7
6
6 0:324 0:125 0:062 0:157 0:068 7
7
6 0:334 0:081 0:028 0:156 0:019 7
6 0:340 0:026 0:010 0:131 0:020 7
6
6 0:348 0:077 0:052 0:062 0:048 7
7
6 0:331 0:208 0:106 0:112 0:020 7
6
6 0:288 0:296 0:165 0:355 0:006 7
7
6 0:202 0:262 0:247 0:537 0:002 7
6 0:005 0:157 0:514 0:165 0:111 7
4
0:099 0:304 0:712 0:259 0:269 5

0:006 0:584 0:002 0:353 0:599


2 3 2 3
0:82 0:28 1:22
6 0:46 7 6 0:12 1:14 7
6
6 0:02 7
7
6
6 0:82 1:17 7
7
6 0:39 7 6 0:76 1:13 7
6
6 0:81 7
7
6
6 0:36 1:47 7
7
6 1:00 7 6 0:19 1:87 7
6 1:19 7 6 0:20 1:93 7
6
6 1:13 7
7
6
6 0:12 1:80 7
7
6 1:04 7 6 0:26 1:67 7
6
6 0:97 7
7
6
6 0:49 1:58 7
7
6 0:98 7 6 0:55 1:54 7
6
6 1:07 7
7
6
6 0:37 1:56 7
7
6 1:21 7 6 0:05 1:64 7
6 1:35 7 6 0:21 1:77 7
6
6 1:35 7
7
6
6 0:49 1:91 7
7
1:25 0:14 1:93
y = ; [ymin ymax] = ;
6 7 6 7
6
6 1:08 7
7
6
6 1:47 1:90 7
7
6 0:62 7 6 2:74 1:68 7
6 0:31 7 6 3:00 1:43 7
6
0:39 3:00 1:44
7 6 7
6 7 6 7
6 0:43 7 6 3:00 1:65 7
6
6 0:36 7
7
6
6 3:00 1:84 7
7
6 0:34 7 6 3:00 1:92 7
6
6 0:32 7
7
6
6 3:00 1:95 7
7
6 0:27 7 6 3:00 2:01 7
6 0:18 7 6 3:00 2:04 7
6
6 0:04 7
7
6
6 3:00 2:09 7
7
6 0:02 7 6 3:00 2:15 7
6
6 0:11 7
7
6
6 3:00 2:17 7
7
6 0:19 7 6 3:00 2:11 7
6 0:22 7 6 3:00 1:77 7
4
0:11 5 4
3:00 1:86 5

28:19 25:92 33:40


90 APPENDIX A. NUMERIC INFORMATION

2 3 2 3
0:334 0:053 0:655 0:356 0:743 0:701
1:300 0:348 0:376 0:541 0:121 0:624
 = ; T = ;
6 7 6 7
6
4 0:027 7
5
6
4 0:225 0:006 0:440 0:859 1:001 7
5
0:572 0:143 0:195 0:747 0:542 0:018
0:407 0:519 0:142 0:248 0:638 0:469

2 3 2 3
6:120 8:763 0:164 0:458 0:185 0:854 0:000
7:643 3:452 0:813 0:030 0:264 0:083 0:511
[ min max] = ; U = ;
6 7 6 7
6
4 4:146 4:535 7
5
6
4 0:378 0:046 0:429 0:045 0:818 7
5
2:678 2:194 0:410 0:012 0:836 0:253 0:263
2:400 2:279 0:032 0:887 0:115 0:444 0:029

2 3 2 3 2 3
2:60 15:4 7:0 18:5 0:063 0:970 0:236
f = 4
3:24 5 ; Tfg = 4
8:3 23:0 12:7 5 ; Uhg = 4
0:692 0:128 0:711 5 ;
3:43 5:2 7:2 35:4 0:719 0:208 0:663

2 3 2 3
2:8 10:1 18:3 0:457 0:007 0:068 0:109 0:810
Tfh = 4 17:0
5:6
1:7
22:8
8:1
37:2
5 ; Sh = 4 0:105
0:084
0:641
0:499
0:399
0:140
0:008
0:560
0:126
0:237
5 ;

2 3 2 3
0:83 0:011 0:019 0:007 0:043 0:056
6 0:45 7 6 0:021 0:043 0:005 0:089 0:161 7
6
6 0:02 7
7
6
6 0:035 0:065 0:001 0:150 0:191 7
7
6 0:55 7 6 0:027 0:023 0:020 0:025 0:256 7
6
6 0:90 7
7
6
6 0:016 0:045 0:008 0:014 0:021 7
7
6 1:08 7 6 0:032 0:092 0:016 0:102 0:153 7
6 1:26 7 6 0:016 0:103 0:015 0:095 0:176 7
6
6 1:17 7
7
6
6 0:002 0:110 0:009 0:085 0:226 7
7
6 1:05 7 6 0:019 0:108 0:001 0:063 0:247 7
6
6 0:95 7
7
6
6 0:034 0:096 0:011 0:030 0:241 7
7
6 0:94 7 6 0:043 0:080 0:018 0:003 0:219 7
6
6 1:01 7
7
6
6 0:046 0:063 0:023 0:012 0:195 7
7
6 1:14 7 6 0:043 0:049 0:021 0:013 0:176 7
6 1:25 7 6 0:054 0:042 0:013 0:011 0:218 7
6
6 1:17 7
7
6
6 0:080 0:036 0:009 0:003 0:324 7
7
1:04 0:092 0:027 0:007 0:004 0:349
0y = ; Tty = ;
6 7 6 7
6
6 0:75 7
7
6
6 0:130 0:018 0:023 0:002 0:383 7
7
6 0:13 7 6 0:153 0:069 0:060 0:016 0:424 7
6 0:18 7 6 0:103 0:069 0:093 0:057 0:432 7
6
0:05 0:047 0:080 0:110 0:104 0:365
7 6 7
6 7 6 7
6 0:01 7 6 0:009 0:112 0:128 0:171 0:271 7
6
6 0:02 7
7
6
6 0:075 0:136 0:140 0:228 0:154 7
7
6 0:05 7 6 0:122 0:151 0:127 0:231 0:016 7
6
6 0:13 7
7
6
6 0:148 0:156 0:102 0:205 0:087 7
7
6 0:15 7 6 0:162 0:162 0:071 0:151 0:139 7
6 0:13 7 6 0:171 0:166 0:030 0:079 0:158 7
6
6 0:08 7
7
6
6 0:188 0:160 0:031 0:044 0:117 7
7
6 0:13 7 6 0:175 0:145 0:132 0:224 0:085 7
6
6 0:20 7
7
6
6 0:105 0:149 0:240 0:451 0:344 7
7
6 0:17 7 6 0:032 0:184 0:335 0:584 0:491 7
6 0:09 7 6 0:300 0:279 0:478 0:183 0:178 7
4
0:18 5 4
0:353 0:269 0:604 0:896 0:092 5

28:89 0:409 0:370 0:149 0:661 0:401


91

2 3
0:024 0:024 0:011 0:019 0:026
6 0:047 0:044 0:005 0:042 0:077 7
6
6 0:070 0:069 0:003 0:070 0:093 7
7
6 0:002 0:069 0:029 0:017 0:126 7
6
6 0:017 0:089 0:044 0:000 0:016 7
7
6 0:056 0:143 0:051 0:040 0:066 7
6 0:023 0:167 0:046 0:036 0:074 7
6
6 0:009 0:176 0:029 0:032 0:096 7
7
6 0:042 0:172 0:007 0:024 0:105 7
6
6 0:073 0:154 0:014 0:011 0:102 7
7
6 0:090 0:128 0:031 0:000 0:093 7
6
6 0:094 0:099 0:042 0:005 0:084 7
7
6 0:086 0:074 0:044 0:005 0:077 7
6 0:096 0:061 0:046 0:003 0:097 7
6
6 0:129 0:042 0:070 0:006 0:147 7
7
0:144 0:024 0:081 0:008 0:159
T y = 6
6
6
6
0:203
0:240
0:064
0:184
0:148
0:248
0:014
0:036
0:180
0:211
7
7
7
7
:
6 0:161 0:226 0:283 0:063 0:226 7
6
6 0:066 0:270 0:288 0:089 0:204 7
7
6 0:031 0:355 0:308 0:127 0:173 7
6
6 0:145 0:422 0:305 0:158 0:129 7
7
6 0:228 0:438 0:256 0:156 0:068 7
6
6 0:278 0:425 0:193 0:137 0:017 7
7
6 0:312 0:410 0:129 0:104 0:012 7
6 0:346 0:388 0:051 0:062 0:026 7
6
6 0:409 0:340 0:069 0:009 0:015 7
7
6 0:447 0:263 0:229 0:112 0:065 7
6
6 0:396 0:203 0:365 0:236 0:168 7
7
6 0:209 0:180 0:433 0:312 0:215 7
6 0:228 0:244 0:370 0:039 0:045 7
4
0:300 0:184 0:440 0:357 0:094 5

0:765 0:526 0:437 0:265 0:190


Appendix B
Results for Isolated Frames
The inverse problem algorithm for isolated frames was applied to all oral vowel frames
present in the analyzed corpus. The results are shown in the next pages. From the
bottom each set of panels shows:
 Vocal-tract midsagittal pro le extracted from cineradiographic data (Bothorel
et al., 1986) plotted on semi-polar grid; and labiogram simultaneously acquired.
 Midsagittal distances computed with the method described in Section 2.1.2.
 The thin line shows the area function estimated from the midsagittal pro le
using the model described in Section 2.1.3. The formants determined
by this area function (using the method described in Section 2.2.3) are used to
estimate the area function shown by the thick line using the method described
in Chapter 4.
 Thin black line: vocal-tract transfer function determined from the area function
estimated from the midsagittal pro le. Thick black line: vocal tract transfer
function determined from the area function estimated from formant frequencies.
Gray line: power spectrum envelope estimated from the speech signal.
 Speech signal recorded during the acquisition of the midsagittal pro le shown
in the bottom panel.

92
93

1 Speech Signal PB0108 1 Speech Signal PB0109 1 Speech Signal PB0110 1 Speech Signal PB0111
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB0108 Power Spectrum (dB) PB0109 Power Spectrum (dB) PB0110 Power Spectrum (dB) PB0111
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB0108 Area Function PB0109 Area Function PB0110 Area Function PB0111

Area (cm2)
Area (cm2)
Area (cm2)

5 Area (cm2) 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)

Midsag. Dist. (cm)


Midsag. Dist. (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB0108 Midsagittal Distances PB0109 Midsagittal Distances PB0110 Midsagittal Distances PB0111
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB0108 Vocal−Tract Profile PB0109 Vocal−Tract Profile PB0110 Vocal−Tract Profile PB0111
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15

cm
cm
cm

cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm

1 Speech Signal PB0112 1 Speech Signal PB0121 1 Speech Signal PB0122 1 Speech Signal PB0123
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB0112 Power Spectrum (dB) PB0121 Power Spectrum (dB) PB0122 Power Spectrum (dB) PB0123
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB0112 Area Function PB0121 Area Function PB0122 Area Function PB0123
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)

5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB0112 Midsagittal Distances PB0121 Midsagittal Distances PB0122 Midsagittal Distances PB0123
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB0112 Vocal−Tract Profile PB0121 Vocal−Tract Profile PB0122 Vocal−Tract Profile PB0123
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm
94 APPENDIX B. RESULTS FOR ISOLATED FRAMES

1 Speech Signal PB0124 1 Speech Signal PB0125 1 Speech Signal PB0130 1 Speech Signal PB0131
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB0124 Power Spectrum (dB) PB0125 Power Spectrum (dB) PB0130 Power Spectrum (dB) PB0131
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB0124 Area Function PB0125 Area Function PB0130 Area Function PB0131

Area (cm2)

Area (cm2)
Area (cm2)

Area (cm2)

5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB0124 Midsagittal Distances PB0125 Midsagittal Distances PB0130 Midsagittal Distances PB0131
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB0124 Vocal−Tract Profile PB0125 Vocal−Tract Profile PB0130 Vocal−Tract Profile PB0131
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm
cm

cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm

1 Speech Signal PB0132 1 Speech Signal PB0133 1 Speech Signal PB0134 1 Speech Signal PB0136
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB0132 Power Spectrum (dB) PB0133 Power Spectrum (dB) PB0134 Power Spectrum (dB) PB0136
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB0132 Area Function PB0133 Area Function PB0134 Area Function PB0136
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)

5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB0132 Midsagittal Distances PB0133 Midsagittal Distances PB0134 Midsagittal Distances PB0136
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB0132 Vocal−Tract Profile PB0133 Vocal−Tract Profile PB0134 Vocal−Tract Profile PB0136
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm
95

1 Speech Signal PB0137 1 Speech Signal PB0138 1 Speech Signal PB0139 1 Speech Signal PB0140
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB0137 Power Spectrum (dB) PB0138 Power Spectrum (dB) PB0139 Power Spectrum (dB) PB0140
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB0137 Area Function PB0138 Area Function PB0139 Area Function PB0140
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)
5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsagittal Distances PB0137 Midsagittal Distances PB0138 Midsagittal Distances PB0139 Midsagittal Distances PB0140
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB0137 Vocal−Tract Profile PB0138 Vocal−Tract Profile PB0139 Vocal−Tract Profile PB0140
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm
10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm

1 Speech Signal PB0141 1 Speech Signal PB0142 1 Speech Signal PB0154 1 Speech Signal PB0155
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB0141 Power Spectrum (dB) PB0142 Power Spectrum (dB) PB0154 Power Spectrum (dB) PB0155
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB0141 Area Function PB0142 Area Function PB0154 Area Function PB0155
Area (cm2)
Area (cm2)
Area (cm2)

Area (cm2)

5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)
Midsag. Dist. (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB0141 Midsagittal Distances PB0142 Midsagittal Distances PB0154 Midsagittal Distances PB0155
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB0141 Vocal−Tract Profile PB0142 Vocal−Tract Profile PB0154 Vocal−Tract Profile PB0155
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm
cm
cm

cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm
96 APPENDIX B. RESULTS FOR ISOLATED FRAMES

1 Speech Signal PB0156 1 Speech Signal PB0157 1 Speech Signal PB0158 1 Speech Signal PB0159
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB0156 Power Spectrum (dB) PB0157 Power Spectrum (dB) PB0158 Power Spectrum (dB) PB0159
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB0156 Area Function PB0157 Area Function PB0158 Area Function PB0159
Area (cm2)

Area (cm2)
Area (cm2)

Area (cm2)
5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsagittal Distances PB0156 Midsagittal Distances PB0157 Midsagittal Distances PB0158 Midsagittal Distances PB0159
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB0156 Vocal−Tract Profile PB0157 Vocal−Tract Profile PB0158 Vocal−Tract Profile PB0159
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm
cm

cm
10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm

1 Speech Signal PB0210 1 Speech Signal PB0211 1 Speech Signal PB0212 1 Speech Signal PB0213
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB0210 Power Spectrum (dB) PB0211 Power Spectrum (dB) PB0212 Power Spectrum (dB) PB0213
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB0210 Area Function PB0211 Area Function PB0212 Area Function PB0213
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)

5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB0210 Midsagittal Distances PB0211 Midsagittal Distances PB0212 Midsagittal Distances PB0213
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB0210 Vocal−Tract Profile PB0211 Vocal−Tract Profile PB0212 Vocal−Tract Profile PB0213
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm
97

1 Speech Signal PB0215 1 Speech Signal PB0216 1 Speech Signal PB0217 1 Speech Signal PB0218
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB0215 Power Spectrum (dB) PB0216 Power Spectrum (dB) PB0217 Power Spectrum (dB) PB0218
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB0215 Area Function PB0216 Area Function PB0217 Area Function PB0218
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)
5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsagittal Distances PB0215 Midsagittal Distances PB0216 Midsagittal Distances PB0217 Midsagittal Distances PB0218
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB0215 Vocal−Tract Profile PB0216 Vocal−Tract Profile PB0217 Vocal−Tract Profile PB0218
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm
10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm

1 Speech Signal PB0219 1 Speech Signal PB0222 1 Speech Signal PB0223 1 Speech Signal PB0224
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB0219 Power Spectrum (dB) PB0222 Power Spectrum (dB) PB0223 Power Spectrum (dB) PB0224
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB0219 Area Function PB0222 Area Function PB0223 Area Function PB0224
Area (cm2)

Area (cm2)

Area (cm2)
Area (cm2)

5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsag. Dist. (cm)

Midsagittal Distances PB0219 Midsagittal Distances PB0222 Midsagittal Distances PB0223 Midsagittal Distances PB0224
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB0219 Vocal−Tract Profile PB0222 Vocal−Tract Profile PB0223 Vocal−Tract Profile PB0224
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm
cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm
98 APPENDIX B. RESULTS FOR ISOLATED FRAMES

1 Speech Signal PB0229 1 Speech Signal PB0230 1 Speech Signal PB0231 1 Speech Signal PB0232
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB0229 Power Spectrum (dB) PB0230 Power Spectrum (dB) PB0231 Power Spectrum (dB) PB0232
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB0229 Area Function PB0230 Area Function PB0231 Area Function PB0232
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)
5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsagittal Distances PB0229 Midsagittal Distances PB0230 Midsagittal Distances PB0231 Midsagittal Distances PB0232
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB0229 Vocal−Tract Profile PB0230 Vocal−Tract Profile PB0231 Vocal−Tract Profile PB0232
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm
10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm

1 Speech Signal PB0233 1 Speech Signal PB0234 1 Speech Signal PB0241 1 Speech Signal PB0242
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB0233 Power Spectrum (dB) PB0234 Power Spectrum (dB) PB0241 Power Spectrum (dB) PB0242
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB0233 Area Function PB0234 Area Function PB0241 Area Function PB0242
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)

5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB0233 Midsagittal Distances PB0234 Midsagittal Distances PB0241 Midsagittal Distances PB0242
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB0233 Vocal−Tract Profile PB0234 Vocal−Tract Profile PB0241 Vocal−Tract Profile PB0242
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm
99

1 Speech Signal PB0243 1 Speech Signal PB0244 1 Speech Signal PB0245 1 Speech Signal PB0310
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB0243 Power Spectrum (dB) PB0244 Power Spectrum (dB) PB0245 Power Spectrum (dB) PB0310
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB0243 Area Function PB0244 Area Function PB0245 Area Function PB0310

Area (cm2)
Area (cm2)

Area (cm2)

Area (cm2)
5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)

Midsag. Dist. (cm)


Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsagittal Distances PB0243 Midsagittal Distances PB0244 Midsagittal Distances PB0245 Midsagittal Distances PB0310
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB0243 Vocal−Tract Profile PB0244 Vocal−Tract Profile PB0245 Vocal−Tract Profile PB0310
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm
cm

cm

cm
10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm

1 Speech Signal PB0311 1 Speech Signal PB0312 1 Speech Signal PB0313 1 Speech Signal PB0314
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB0311 Power Spectrum (dB) PB0312 Power Spectrum (dB) PB0313 Power Spectrum (dB) PB0314
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB0311 Area Function PB0312 Area Function PB0313 Area Function PB0314
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)

5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB0311 Midsagittal Distances PB0312 Midsagittal Distances PB0313 Midsagittal Distances PB0314
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB0311 Vocal−Tract Profile PB0312 Vocal−Tract Profile PB0313 Vocal−Tract Profile PB0314
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm
100 APPENDIX B. RESULTS FOR ISOLATED FRAMES

1 Speech Signal PB0315 1 Speech Signal PB0331 1 Speech Signal PB0332 1 Speech Signal PB0333
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB0315 Power Spectrum (dB) PB0331 Power Spectrum (dB) PB0332 Power Spectrum (dB) PB0333
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB0315 Area Function PB0331 Area Function PB0332 Area Function PB0333

Area (cm2)

Area (cm2)
Area (cm2)

Area (cm2)

5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB0315 Midsagittal Distances PB0331 Midsagittal Distances PB0332 Midsagittal Distances PB0333
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB0315 Vocal−Tract Profile PB0331 Vocal−Tract Profile PB0332 Vocal−Tract Profile PB0333
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm
cm

cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm

1 Speech Signal PB0334 1 Speech Signal PB0343 1 Speech Signal PB0344 1 Speech Signal PB0345
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB0334 Power Spectrum (dB) PB0343 Power Spectrum (dB) PB0344 Power Spectrum (dB) PB0345
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB0334 Area Function PB0343 Area Function PB0344 Area Function PB0345
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)

5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB0334 Midsagittal Distances PB0343 Midsagittal Distances PB0344 Midsagittal Distances PB0345
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB0334 Vocal−Tract Profile PB0343 Vocal−Tract Profile PB0344 Vocal−Tract Profile PB0345
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm
101

1 Speech Signal PB0346 1 Speech Signal PB0347 1 Speech Signal PB0348 1 Speech Signal PB0349
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB0346 Power Spectrum (dB) PB0347 Power Spectrum (dB) PB0348 Power Spectrum (dB) PB0349
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB0346 Area Function PB0347 Area Function PB0348 Area Function PB0349
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)
5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsagittal Distances PB0346 Midsagittal Distances PB0347 Midsagittal Distances PB0348 Midsagittal Distances PB0349
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB0346 Vocal−Tract Profile PB0347 Vocal−Tract Profile PB0348 Vocal−Tract Profile PB0349
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm
10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm

1 Speech Signal PB0805 1 Speech Signal PB0806 1 Speech Signal PB0807 1 Speech Signal PB0812
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB0805 Power Spectrum (dB) PB0806 Power Spectrum (dB) PB0807 Power Spectrum (dB) PB0812
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB0805 Area Function PB0806 Area Function PB0807 Area Function PB0812
Area (cm2)
Area (cm2)

Area (cm2)

Area (cm2)

5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB0805 Midsagittal Distances PB0806 Midsagittal Distances PB0807 Midsagittal Distances PB0812
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB0805 Vocal−Tract Profile PB0806 Vocal−Tract Profile PB0807 Vocal−Tract Profile PB0812
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm
cm

cm

cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm
102 APPENDIX B. RESULTS FOR ISOLATED FRAMES

1 Speech Signal PB0813 1 Speech Signal PB0814 1 Speech Signal PB0815 1 Speech Signal PB0838
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB0813 Power Spectrum (dB) PB0814 Power Spectrum (dB) PB0815 Power Spectrum (dB) PB0838
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB0813 Area Function PB0814 Area Function PB0815 Area Function PB0838

Area (cm2)
Area (cm2)

Area (cm2)

Area (cm2)
5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)

Midsag. Dist. (cm)


Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsagittal Distances PB0813 Midsagittal Distances PB0814 Midsagittal Distances PB0815 Midsagittal Distances PB0838
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB0813 Vocal−Tract Profile PB0814 Vocal−Tract Profile PB0815 Vocal−Tract Profile PB0838
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15

cm
cm

cm

cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm

1 Speech Signal PB0839 1 Speech Signal PB0840 1 Speech Signal PB0841 1 Speech Signal PB0842
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB0839 Power Spectrum (dB) PB0840 Power Spectrum (dB) PB0841 Power Spectrum (dB) PB0842
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB0839 Area Function PB0840 Area Function PB0841 Area Function PB0842
Area (cm2)
Area (cm2)

Area (cm2)

Area (cm2)

5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB0839 Midsagittal Distances PB0840 Midsagittal Distances PB0841 Midsagittal Distances PB0842
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB0839 Vocal−Tract Profile PB0840 Vocal−Tract Profile PB0841 Vocal−Tract Profile PB0842
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm
cm

cm

cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm
103

1 Speech Signal PB0843 1 Speech Signal PB0848 1 Speech Signal PB0849 1 Speech Signal PB0850
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB0843 Power Spectrum (dB) PB0848 Power Spectrum (dB) PB0849 Power Spectrum (dB) PB0850
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB0843 Area Function PB0848 Area Function PB0849 Area Function PB0850
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)
5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsagittal Distances PB0843 Midsagittal Distances PB0848 Midsagittal Distances PB0849 Midsagittal Distances PB0850
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB0843 Vocal−Tract Profile PB0848 Vocal−Tract Profile PB0849 Vocal−Tract Profile PB0850
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm
10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm

1 Speech Signal PB0851 1 Speech Signal PB0852 1 Speech Signal PB0853 1 Speech Signal PB0854
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB0851 Power Spectrum (dB) PB0852 Power Spectrum (dB) PB0853 Power Spectrum (dB) PB0854
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB0851 Area Function PB0852 Area Function PB0853 Area Function PB0854
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)

5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB0851 Midsagittal Distances PB0852 Midsagittal Distances PB0853 Midsagittal Distances PB0854
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB0851 Vocal−Tract Profile PB0852 Vocal−Tract Profile PB0853 Vocal−Tract Profile PB0854
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm
104 APPENDIX B. RESULTS FOR ISOLATED FRAMES

1 Speech Signal PB0855 1 Speech Signal PB0856 1 Speech Signal PB0916 1 Speech Signal PB0917
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB0855 Power Spectrum (dB) PB0856 Power Spectrum (dB) PB0916 Power Spectrum (dB) PB0917
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB0855 Area Function PB0856 Area Function PB0916 Area Function PB0917
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)
5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsagittal Distances PB0855 Midsagittal Distances PB0856 Midsagittal Distances PB0916 Midsagittal Distances PB0917
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB0855 Vocal−Tract Profile PB0856 Vocal−Tract Profile PB0916 Vocal−Tract Profile PB0917
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm
10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm

1 Speech Signal PB0918 1 Speech Signal PB0919 1 Speech Signal PB0920 1 Speech Signal PB0921
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB0918 Power Spectrum (dB) PB0919 Power Spectrum (dB) PB0920 Power Spectrum (dB) PB0921
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB0918 Area Function PB0919 Area Function PB0920 Area Function PB0921
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)

5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB0918 Midsagittal Distances PB0919 Midsagittal Distances PB0920 Midsagittal Distances PB0921
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB0918 Vocal−Tract Profile PB0919 Vocal−Tract Profile PB0920 Vocal−Tract Profile PB0921
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm
105

1 Speech Signal PB0922 1 Speech Signal PB0923 1 Speech Signal PB0924 1 Speech Signal PB0925
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB0922 Power Spectrum (dB) PB0923 Power Spectrum (dB) PB0924 Power Spectrum (dB) PB0925
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB0922 Area Function PB0923 Area Function PB0924 Area Function PB0925
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)
5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsagittal Distances PB0922 Midsagittal Distances PB0923 Midsagittal Distances PB0924 Midsagittal Distances PB0925
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB0922 Vocal−Tract Profile PB0923 Vocal−Tract Profile PB0924 Vocal−Tract Profile PB0925
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm
10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm

1 Speech Signal PB0944 1 Speech Signal PB0945 1 Speech Signal PB0946 1 Speech Signal PB0947
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB0944 Power Spectrum (dB) PB0945 Power Spectrum (dB) PB0946 Power Spectrum (dB) PB0947
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB0944 Area Function PB0945 Area Function PB0946 Area Function PB0947
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)

5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB0944 Midsagittal Distances PB0945 Midsagittal Distances PB0946 Midsagittal Distances PB0947
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB0944 Vocal−Tract Profile PB0945 Vocal−Tract Profile PB0946 Vocal−Tract Profile PB0947
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm
106 APPENDIX B. RESULTS FOR ISOLATED FRAMES

1 Speech Signal PB0948 1 Speech Signal PB0949 1 Speech Signal PB0950 1 Speech Signal PB0961
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB0948 Power Spectrum (dB) PB0949 Power Spectrum (dB) PB0950 Power Spectrum (dB) PB0961
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB0948 Area Function PB0949 Area Function PB0950 Area Function PB0961
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)
5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsagittal Distances PB0948 Midsagittal Distances PB0949 Midsagittal Distances PB0950 Midsagittal Distances PB0961
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB0948 Vocal−Tract Profile PB0949 Vocal−Tract Profile PB0950 Vocal−Tract Profile PB0961
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm
10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm

1 Speech Signal PB0962 1 Speech Signal PB0963 1 Speech Signal PB0964 1 Speech Signal PB0965
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB0962 Power Spectrum (dB) PB0963 Power Spectrum (dB) PB0964 Power Spectrum (dB) PB0965
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB0962 Area Function PB0963 Area Function PB0964 Area Function PB0965
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)

5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB0962 Midsagittal Distances PB0963 Midsagittal Distances PB0964 Midsagittal Distances PB0965
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB0962 Vocal−Tract Profile PB0963 Vocal−Tract Profile PB0964 Vocal−Tract Profile PB0965
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm
107

1 Speech Signal PB0966 1 Speech Signal PB0967 1 Speech Signal PB0968 1 Speech Signal PB1515
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB0966 Power Spectrum (dB) PB0967 Power Spectrum (dB) PB0968 Power Spectrum (dB) PB1515
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB0966 Area Function PB0967 Area Function PB0968 Area Function PB1515
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)
5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsagittal Distances PB0966 Midsagittal Distances PB0967 Midsagittal Distances PB0968 Midsagittal Distances PB1515
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB0966 Vocal−Tract Profile PB0967 Vocal−Tract Profile PB0968 Vocal−Tract Profile PB1515
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm
10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm

1 Speech Signal PB1516 1 Speech Signal PB1517 1 Speech Signal PB1518 1 Speech Signal PB1519
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB1516 Power Spectrum (dB) PB1517 Power Spectrum (dB) PB1518 Power Spectrum (dB) PB1519
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB1516 Area Function PB1517 Area Function PB1518 Area Function PB1519
Area (cm2)

Area (cm2)

Area (cm2)
Area (cm2)

5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsag. Dist. (cm)

Midsagittal Distances PB1516 Midsagittal Distances PB1517 Midsagittal Distances PB1518 Midsagittal Distances PB1519
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB1516 Vocal−Tract Profile PB1517 Vocal−Tract Profile PB1518 Vocal−Tract Profile PB1519
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm
cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm
108 APPENDIX B. RESULTS FOR ISOLATED FRAMES

1 Speech Signal PB1526 1 Speech Signal PB1527 1 Speech Signal PB1528 1 Speech Signal PB1529
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB1526 Power Spectrum (dB) PB1527 Power Spectrum (dB) PB1528 Power Spectrum (dB) PB1529
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB1526 Area Function PB1527 Area Function PB1528 Area Function PB1529
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)
5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsagittal Distances PB1526 Midsagittal Distances PB1527 Midsagittal Distances PB1528 Midsagittal Distances PB1529
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB1526 Vocal−Tract Profile PB1527 Vocal−Tract Profile PB1528 Vocal−Tract Profile PB1529
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm
10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm

1 Speech Signal PB1534 1 Speech Signal PB1535 1 Speech Signal PB1536 1 Speech Signal PB1537
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB1534 Power Spectrum (dB) PB1535 Power Spectrum (dB) PB1536 Power Spectrum (dB) PB1537
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB1534 Area Function PB1535 Area Function PB1536 Area Function PB1537
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)

5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB1534 Midsagittal Distances PB1535 Midsagittal Distances PB1536 Midsagittal Distances PB1537
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB1534 Vocal−Tract Profile PB1535 Vocal−Tract Profile PB1536 Vocal−Tract Profile PB1537
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm
109

1 Speech Signal PB1538 1 Speech Signal PB1539 1 Speech Signal PB1540 1 Speech Signal PB1544
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB1538 Power Spectrum (dB) PB1539 Power Spectrum (dB) PB1540 Power Spectrum (dB) PB1544
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB1538 Area Function PB1539 Area Function PB1540 Area Function PB1544

Area (cm2)
Area (cm2)

Area (cm2)

Area (cm2)
5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)

Midsag. Dist. (cm)


Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsagittal Distances PB1538 Midsagittal Distances PB1539 Midsagittal Distances PB1540 Midsagittal Distances PB1544
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB1538 Vocal−Tract Profile PB1539 Vocal−Tract Profile PB1540 Vocal−Tract Profile PB1544
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15

cm
cm

cm

cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm

1 Speech Signal PB1545 1 Speech Signal PB1546 1 Speech Signal PB1547 1 Speech Signal PB1548
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB1545 Power Spectrum (dB) PB1546 Power Spectrum (dB) PB1547 Power Spectrum (dB) PB1548
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB1545 Area Function PB1546 Area Function PB1547 Area Function PB1548
Area (cm2)

Area (cm2)
Area (cm2)

Area (cm2)

5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB1545 Midsagittal Distances PB1546 Midsagittal Distances PB1547 Midsagittal Distances PB1548
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB1545 Vocal−Tract Profile PB1546 Vocal−Tract Profile PB1547 Vocal−Tract Profile PB1548
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm
cm

cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm
110 APPENDIX B. RESULTS FOR ISOLATED FRAMES

1 Speech Signal PB1549 1 Speech Signal PB1550 1 Speech Signal PB1556 1 Speech Signal PB1557
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB1549 Power Spectrum (dB) PB1550 Power Spectrum (dB) PB1556 Power Spectrum (dB) PB1557
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB1549 Area Function PB1550 Area Function PB1556 Area Function PB1557
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)
5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsagittal Distances PB1549 Midsagittal Distances PB1550 Midsagittal Distances PB1556 Midsagittal Distances PB1557
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB1549 Vocal−Tract Profile PB1550 Vocal−Tract Profile PB1556 Vocal−Tract Profile PB1557
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm
10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm

1 Speech Signal PB1558 1 Speech Signal PB1559 1 Speech Signal PB1560 1 Speech Signal PB1561
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB1558 Power Spectrum (dB) PB1559 Power Spectrum (dB) PB1560 Power Spectrum (dB) PB1561
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB1558 Area Function PB1559 Area Function PB1560 Area Function PB1561
Area (cm2)
Area (cm2)

Area (cm2)

Area (cm2)

5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB1558 Midsagittal Distances PB1559 Midsagittal Distances PB1560 Midsagittal Distances PB1561
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB1558 Vocal−Tract Profile PB1559 Vocal−Tract Profile PB1560 Vocal−Tract Profile PB1561
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm
cm

cm

cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm
111

1 Speech Signal PB1562 1 Speech Signal PB1714 1 Speech Signal PB1715 1 Speech Signal PB1716
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB1562 Power Spectrum (dB) PB1714 Power Spectrum (dB) PB1715 Power Spectrum (dB) PB1716
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB1562 Area Function PB1714 Area Function PB1715 Area Function PB1716
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)
5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsagittal Distances PB1562 Midsagittal Distances PB1714 Midsagittal Distances PB1715 Midsagittal Distances PB1716
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB1562 Vocal−Tract Profile PB1714 Vocal−Tract Profile PB1715 Vocal−Tract Profile PB1716
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm
10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm

1 Speech Signal PB1719 1 Speech Signal PB1720 1 Speech Signal PB1721 1 Speech Signal PB1722
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB1719 Power Spectrum (dB) PB1720 Power Spectrum (dB) PB1721 Power Spectrum (dB) PB1722
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB1719 Area Function PB1720 Area Function PB1721 Area Function PB1722
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)

5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB1719 Midsagittal Distances PB1720 Midsagittal Distances PB1721 Midsagittal Distances PB1722
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB1719 Vocal−Tract Profile PB1720 Vocal−Tract Profile PB1721 Vocal−Tract Profile PB1722
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm
112 APPENDIX B. RESULTS FOR ISOLATED FRAMES

1 Speech Signal PB1729 1 Speech Signal PB1730 1 Speech Signal PB1731 1 Speech Signal PB1732
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB1729 Power Spectrum (dB) PB1730 Power Spectrum (dB) PB1731 Power Spectrum (dB) PB1732
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB1729 Area Function PB1730 Area Function PB1731 Area Function PB1732

Area (cm2)

Area (cm2)
Area (cm2)

Area (cm2)

5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB1729 Midsagittal Distances PB1730 Midsagittal Distances PB1731 Midsagittal Distances PB1732
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB1729 Vocal−Tract Profile PB1730 Vocal−Tract Profile PB1731 Vocal−Tract Profile PB1732
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm
cm

cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm

1 Speech Signal PB1733 1 Speech Signal PB1738 1 Speech Signal PB1739 1 Speech Signal PB1740
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB1733 Power Spectrum (dB) PB1738 Power Spectrum (dB) PB1739 Power Spectrum (dB) PB1740
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB1733 Area Function PB1738 Area Function PB1739 Area Function PB1740
Area (cm2)

Area (cm2)

Area (cm2)
Area (cm2)

5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsag. Dist. (cm)

Midsagittal Distances PB1733 Midsagittal Distances PB1738 Midsagittal Distances PB1739 Midsagittal Distances PB1740
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB1733 Vocal−Tract Profile PB1738 Vocal−Tract Profile PB1739 Vocal−Tract Profile PB1740
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm
cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm
113

1 Speech Signal PB1741 1 Speech Signal PB1742 1 Speech Signal PB1743 1 Speech Signal PB1754
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB1741 Power Spectrum (dB) PB1742 Power Spectrum (dB) PB1743 Power Spectrum (dB) PB1754
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB1741 Area Function PB1742 Area Function PB1743 Area Function PB1754
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)
5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsagittal Distances PB1741 Midsagittal Distances PB1742 Midsagittal Distances PB1743 Midsagittal Distances PB1754
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB1741 Vocal−Tract Profile PB1742 Vocal−Tract Profile PB1743 Vocal−Tract Profile PB1754
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm
10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm

1 Speech Signal PB1755 1 Speech Signal PB1756 1 Speech Signal PB1757 1 Speech Signal PB1758
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB1755 Power Spectrum (dB) PB1756 Power Spectrum (dB) PB1757 Power Spectrum (dB) PB1758
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB1755 Area Function PB1756 Area Function PB1757 Area Function PB1758
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)

5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB1755 Midsagittal Distances PB1756 Midsagittal Distances PB1757 Midsagittal Distances PB1758
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB1755 Vocal−Tract Profile PB1756 Vocal−Tract Profile PB1757 Vocal−Tract Profile PB1758
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm
114 APPENDIX B. RESULTS FOR ISOLATED FRAMES

1 Speech Signal PB1820 1 Speech Signal PB1821 1 Speech Signal PB1822 1 Speech Signal PB1823
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB1820 Power Spectrum (dB) PB1821 Power Spectrum (dB) PB1822 Power Spectrum (dB) PB1823
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB1820 Area Function PB1821 Area Function PB1822 Area Function PB1823
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)
5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsagittal Distances PB1820 Midsagittal Distances PB1821 Midsagittal Distances PB1822 Midsagittal Distances PB1823
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB1820 Vocal−Tract Profile PB1821 Vocal−Tract Profile PB1822 Vocal−Tract Profile PB1823
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm
10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm

1 Speech Signal PB1832 1 Speech Signal PB1833 1 Speech Signal PB1834 1 Speech Signal PB1835
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB1832 Power Spectrum (dB) PB1833 Power Spectrum (dB) PB1834 Power Spectrum (dB) PB1835
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB1832 Area Function PB1833 Area Function PB1834 Area Function PB1835
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)

5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB1832 Midsagittal Distances PB1833 Midsagittal Distances PB1834 Midsagittal Distances PB1835
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB1832 Vocal−Tract Profile PB1833 Vocal−Tract Profile PB1834 Vocal−Tract Profile PB1835
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm
115

1 Speech Signal PB1836 1 Speech Signal PB1837 1 Speech Signal PB1845 1 Speech Signal PB1846
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB1836 Power Spectrum (dB) PB1837 Power Spectrum (dB) PB1845 Power Spectrum (dB) PB1846
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB1836 Area Function PB1837 Area Function PB1845 Area Function PB1846

Area (cm2)

Area (cm2)

Area (cm2)
Area (cm2)

5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsag. Dist. (cm)

Midsagittal Distances PB1836 Midsagittal Distances PB1837 Midsagittal Distances PB1845 Midsagittal Distances PB1846
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB1836 Vocal−Tract Profile PB1837 Vocal−Tract Profile PB1845 Vocal−Tract Profile PB1846
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm
cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm

1 Speech Signal PB1847 1 Speech Signal PB1848 1 Speech Signal PB1849 1 Speech Signal PB1850
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB1847 Power Spectrum (dB) PB1848 Power Spectrum (dB) PB1849 Power Spectrum (dB) PB1850
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB1847 Area Function PB1848 Area Function PB1849 Area Function PB1850
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)

5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB1847 Midsagittal Distances PB1848 Midsagittal Distances PB1849 Midsagittal Distances PB1850
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB1847 Vocal−Tract Profile PB1848 Vocal−Tract Profile PB1849 Vocal−Tract Profile PB1850
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm
116 APPENDIX B. RESULTS FOR ISOLATED FRAMES

1 Speech Signal PB1851 1 Speech Signal PB1852 1 Speech Signal PB1856 1 Speech Signal PB1857
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB1851 Power Spectrum (dB) PB1852 Power Spectrum (dB) PB1856 Power Spectrum (dB) PB1857
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB1851 Area Function PB1852 Area Function PB1856 Area Function PB1857
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)
5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsagittal Distances PB1851 Midsagittal Distances PB1852 Midsagittal Distances PB1856 Midsagittal Distances PB1857
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB1851 Vocal−Tract Profile PB1852 Vocal−Tract Profile PB1856 Vocal−Tract Profile PB1857
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm
10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm

1 Speech Signal PB1858 1 Speech Signal PB1859 1 Speech Signal PB1860 1 Speech Signal PB1870
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB1858 Power Spectrum (dB) PB1859 Power Spectrum (dB) PB1860 Power Spectrum (dB) PB1870
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB1858 Area Function PB1859 Area Function PB1860 Area Function PB1870
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)

5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB1858 Midsagittal Distances PB1859 Midsagittal Distances PB1860 Midsagittal Distances PB1870
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB1858 Vocal−Tract Profile PB1859 Vocal−Tract Profile PB1860 Vocal−Tract Profile PB1870
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm
117

1 Speech Signal PB1871 1 Speech Signal PB1872 1 Speech Signal PB1873 1 Speech Signal PB1874
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB1871 Power Spectrum (dB) PB1872 Power Spectrum (dB) PB1873 Power Spectrum (dB) PB1874
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB1871 Area Function PB1872 Area Function PB1873 Area Function PB1874
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)
5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsagittal Distances PB1871 Midsagittal Distances PB1872 Midsagittal Distances PB1873 Midsagittal Distances PB1874
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB1871 Vocal−Tract Profile PB1872 Vocal−Tract Profile PB1873 Vocal−Tract Profile PB1874
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm
10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm

1 Speech Signal PB1875 1 Speech Signal PB1876 1 Speech Signal PB2413 1 Speech Signal PB2414
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB1875 Power Spectrum (dB) PB1876 Power Spectrum (dB) PB2413 Power Spectrum (dB) PB2414
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB1875 Area Function PB1876 Area Function PB2413 Area Function PB2414
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)

5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB1875 Midsagittal Distances PB1876 Midsagittal Distances PB2413 Midsagittal Distances PB2414
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB1875 Vocal−Tract Profile PB1876 Vocal−Tract Profile PB2413 Vocal−Tract Profile PB2414
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm
118 APPENDIX B. RESULTS FOR ISOLATED FRAMES

1 Speech Signal PB2415 1 Speech Signal PB2418 1 Speech Signal PB2419 1 Speech Signal PB2420
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB2415 Power Spectrum (dB) PB2418 Power Spectrum (dB) PB2419 Power Spectrum (dB) PB2420
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB2415 Area Function PB2418 Area Function PB2419 Area Function PB2420
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)
5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsagittal Distances PB2415 Midsagittal Distances PB2418 Midsagittal Distances PB2419 Midsagittal Distances PB2420
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB2415 Vocal−Tract Profile PB2418 Vocal−Tract Profile PB2419 Vocal−Tract Profile PB2420
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm
10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm

1 Speech Signal PB2421 1 Speech Signal PB2425 1 Speech Signal PB2426 1 Speech Signal PB2427
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB2421 Power Spectrum (dB) PB2425 Power Spectrum (dB) PB2426 Power Spectrum (dB) PB2427
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB2421 Area Function PB2425 Area Function PB2426 Area Function PB2427
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)

5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB2421 Midsagittal Distances PB2425 Midsagittal Distances PB2426 Midsagittal Distances PB2427
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB2421 Vocal−Tract Profile PB2425 Vocal−Tract Profile PB2426 Vocal−Tract Profile PB2427
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm
119

1 Speech Signal PB2428 1 Speech Signal PB2431 1 Speech Signal PB2432 1 Speech Signal PB2433
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB2428 Power Spectrum (dB) PB2431 Power Spectrum (dB) PB2432 Power Spectrum (dB) PB2433
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB2428 Area Function PB2431 Area Function PB2432 Area Function PB2433
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)
5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsagittal Distances PB2428 Midsagittal Distances PB2431 Midsagittal Distances PB2432 Midsagittal Distances PB2433
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB2428 Vocal−Tract Profile PB2431 Vocal−Tract Profile PB2432 Vocal−Tract Profile PB2433
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm
10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm

1 Speech Signal PB2434 1 Speech Signal PB2435 1 Speech Signal PB2441 1 Speech Signal PB2442
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB2434 Power Spectrum (dB) PB2435 Power Spectrum (dB) PB2441 Power Spectrum (dB) PB2442
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB2434 Area Function PB2435 Area Function PB2441 Area Function PB2442
Area (cm2)

Area (cm2)
Area (cm2)

Area (cm2)

5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB2434 Midsagittal Distances PB2435 Midsagittal Distances PB2441 Midsagittal Distances PB2442
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB2434 Vocal−Tract Profile PB2435 Vocal−Tract Profile PB2441 Vocal−Tract Profile PB2442
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm
cm

cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm
120 APPENDIX B. RESULTS FOR ISOLATED FRAMES

1 Speech Signal PB2443 1 Speech Signal PB2444 1 Speech Signal PB2445 1 Speech Signal PB2446
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB2443 Power Spectrum (dB) PB2444 Power Spectrum (dB) PB2445 Power Spectrum (dB) PB2446
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB2443 Area Function PB2444 Area Function PB2445 Area Function PB2446
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)
5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsagittal Distances PB2443 Midsagittal Distances PB2444 Midsagittal Distances PB2445 Midsagittal Distances PB2446
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB2443 Vocal−Tract Profile PB2444 Vocal−Tract Profile PB2445 Vocal−Tract Profile PB2446
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm
10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm

1 Speech Signal PB2447 1 Speech Signal PB2448 1 Speech Signal PB2449 1 Speech Signal PB2450
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB2447 Power Spectrum (dB) PB2448 Power Spectrum (dB) PB2449 Power Spectrum (dB) PB2450
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB2447 Area Function PB2448 Area Function PB2449 Area Function PB2450
Area (cm2)

Area (cm2)

Area (cm2)
Area (cm2)

5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsag. Dist. (cm)

Midsagittal Distances PB2447 Midsagittal Distances PB2448 Midsagittal Distances PB2449 Midsagittal Distances PB2450
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB2447 Vocal−Tract Profile PB2448 Vocal−Tract Profile PB2449 Vocal−Tract Profile PB2450
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm
cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm
121

1 Speech Signal PB2451 1 Speech Signal PB2810 1 Speech Signal PB2811 1 Speech Signal PB2812
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB2451 Power Spectrum (dB) PB2810 Power Spectrum (dB) PB2811 Power Spectrum (dB) PB2812
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB2451 Area Function PB2810 Area Function PB2811 Area Function PB2812
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)
5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsagittal Distances PB2451 Midsagittal Distances PB2810 Midsagittal Distances PB2811 Midsagittal Distances PB2812
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB2451 Vocal−Tract Profile PB2810 Vocal−Tract Profile PB2811 Vocal−Tract Profile PB2812
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm
10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm

1 Speech Signal PB2813 1 Speech Signal PB2814 1 Speech Signal PB2823 1 Speech Signal PB2824
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB2813 Power Spectrum (dB) PB2814 Power Spectrum (dB) PB2823 Power Spectrum (dB) PB2824
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB2813 Area Function PB2814 Area Function PB2823 Area Function PB2824
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)

5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB2813 Midsagittal Distances PB2814 Midsagittal Distances PB2823 Midsagittal Distances PB2824
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB2813 Vocal−Tract Profile PB2814 Vocal−Tract Profile PB2823 Vocal−Tract Profile PB2824
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm
122 APPENDIX B. RESULTS FOR ISOLATED FRAMES

1 Speech Signal PB2825 1 Speech Signal PB2826 1 Speech Signal PB2827 1 Speech Signal PB2828
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB2825 Power Spectrum (dB) PB2826 Power Spectrum (dB) PB2827 Power Spectrum (dB) PB2828
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB2825 Area Function PB2826 Area Function PB2827 Area Function PB2828

Area (cm2)
Area (cm2)

Area (cm2)

Area (cm2)
5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)

Midsag. Dist. (cm)


Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsagittal Distances PB2825 Midsagittal Distances PB2826 Midsagittal Distances PB2827 Midsagittal Distances PB2828
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB2825 Vocal−Tract Profile PB2826 Vocal−Tract Profile PB2827 Vocal−Tract Profile PB2828
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15

cm
cm

cm

cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm

1 Speech Signal PB2834 1 Speech Signal PB2835 1 Speech Signal PB2836 1 Speech Signal PB2837
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB2834 Power Spectrum (dB) PB2835 Power Spectrum (dB) PB2836 Power Spectrum (dB) PB2837
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB2834 Area Function PB2835 Area Function PB2836 Area Function PB2837
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)

5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB2834 Midsagittal Distances PB2835 Midsagittal Distances PB2836 Midsagittal Distances PB2837
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB2834 Vocal−Tract Profile PB2835 Vocal−Tract Profile PB2836 Vocal−Tract Profile PB2837
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm
123

1 Speech Signal PB2838 1 Speech Signal PB2842 1 Speech Signal PB2843 1 Speech Signal PB2844
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB2838 Power Spectrum (dB) PB2842 Power Spectrum (dB) PB2843 Power Spectrum (dB) PB2844
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB2838 Area Function PB2842 Area Function PB2843 Area Function PB2844
Area (cm2)

Area (cm2)

Area (cm2)

Area (cm2)
5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsagittal Distances PB2838 Midsagittal Distances PB2842 Midsagittal Distances PB2843 Midsagittal Distances PB2844
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB2838 Vocal−Tract Profile PB2842 Vocal−Tract Profile PB2843 Vocal−Tract Profile PB2844
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm

cm

cm

cm
10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm

1 Speech Signal PB2845 1 Speech Signal PB2846 1 Speech Signal PB2847 1 Speech Signal PB2851
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB2845 Power Spectrum (dB) PB2846 Power Spectrum (dB) PB2847 Power Spectrum (dB) PB2851
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB2845 Area Function PB2846 Area Function PB2847 Area Function PB2851
Area (cm2)
Area (cm2)

Area (cm2)

Area (cm2)

5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB2845 Midsagittal Distances PB2846 Midsagittal Distances PB2847 Midsagittal Distances PB2851
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB2845 Vocal−Tract Profile PB2846 Vocal−Tract Profile PB2847 Vocal−Tract Profile PB2851
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm
cm

cm

cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm
124 APPENDIX B. RESULTS FOR ISOLATED FRAMES

1 Speech Signal PB2852 1 Speech Signal PB2853 1 Speech Signal PB2854 1 Speech Signal PB2855
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB2852 Power Spectrum (dB) PB2853 Power Spectrum (dB) PB2854 Power Spectrum (dB) PB2855
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB2852 Area Function PB2853 Area Function PB2854 Area Function PB2855

Area (cm2)
Area (cm2)

Area (cm2)

Area (cm2)
5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)

Midsag. Dist. (cm)


Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsagittal Distances PB2852 Midsagittal Distances PB2853 Midsagittal Distances PB2854 Midsagittal Distances PB2855
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB2852 Vocal−Tract Profile PB2853 Vocal−Tract Profile PB2854 Vocal−Tract Profile PB2855
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15

cm
cm

cm

cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm

1 Speech Signal PB2856 1 Speech Signal PB2857 1 Speech Signal PB2858 1 Speech Signal PB2859
0 0 0 0
−1 −1 −1 −1
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms) Time (ms) Time (ms)
Power Spectrum (dB) PB2856 Power Spectrum (dB) PB2857 Power Spectrum (dB) PB2858 Power Spectrum (dB) PB2859
40 40 40 40

20 20 20 20

0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz)
10 10 10 10
Area Function PB2856 Area Function PB2857 Area Function PB2858 Area Function PB2859
Area (cm2)
Area (cm2)

Area (cm2)
Area (cm2)

5 5 5 5

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsag. Dist. (cm)

Midsagittal Distances PB2856 Midsagittal Distances PB2857 Midsagittal Distances PB2858 Midsagittal Distances PB2859
4 4 4 4

2 2 2 2

0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm) Distance from Glottis (cm)
15 15 15 15
Vocal−Tract Profile PB2856 Vocal−Tract Profile PB2857 Vocal−Tract Profile PB2858 Vocal−Tract Profile PB2859
25 20 25 20 25 20 25 20
30 30 30 30

10 10 10 10
15 15 15 15
cm
cm

cm
cm

10 10 10 10
5 5 5 5
5 5 5 5

0 0 0 0
0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
cm cm cm cm
125

1 Speech Signal PB2860 1 Speech Signal PB2861


0 0
−1 −1
0 5 10 15 20 0 5 10 15 20
Time (ms) Time (ms)
Power Spectrum (dB) PB2860 Power Spectrum (dB) PB2861
40 40

20 20

0 0

0 1 2 3 4 5 0 1 2 3 4 5
Frequency (kHz) Frequency (kHz)
10 10
Area Function PB2860 Area Function PB2861

Area (cm2)

Area (cm2)
5 5

0 0
0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm)
Midsag. Dist. (cm)

Midsag. Dist. (cm)


Midsagittal Distances PB2860 Midsagittal Distances PB2861
4 4

2 2

0 0
0 5 10 15 0 5 10 15
Distance from Glottis (cm) Distance from Glottis (cm)
15 15
Vocal−Tract Profile PB2860 Vocal−Tract Profile PB2861
25 20 25 20
30 30

10 10
15 15
cm

cm

10 10
5 5
5 5

0 0
0 0
0 5 10 15 0 5 10 15
cm cm
Bibliography
[ACT78] B. S. Atal, J. J. Chang, and J. W. Tukey. Inversion of articulatory-to-
acoustic transformation in the vocal-tract by a computing sorting tech-
nique. The Journal of the Acoustical Society of America, 63(5):1535{1555,
1978.
[AH71] B. S. Atal and S. L. Hanauer. Speech analysis and synthesis by linear
prediction of the speech wave. The Journal of the Acoustical Society of
America, 50:637{655, 1971.
[BBL95] D. Beautemps, P. Badin, and R. Laboissiere. Deriving vocal-tract area
functions from midsagittal pro les and formant frequencies: a new model
for vowels and fricative consonants based on experimental data. Speech
Communication, 16:27{47, 1995.
[BGGN91] T. Baer, J. C. Gore, L. C. Gracco, and P. W. Nye. Analysis of vocal tract
shape and dimensions using magnetic resonance imaging. The Journal
of the Acoustical Society of America, 90(2):799{828, 1991.
[BLS91] G. Bailly, R. Laboissiere, and J. L. Schwartz. Formant trajectories as au-
dible gestures: an alternative for speech synthesis. Journal of Phonetics,
19:9{23, 1991.
[BS95] A. Bell and J. Seijnowski. Blind separation and blind deconvolution: an
information-theoretic approach. In Proc. IEEE International Conference
on Acoustics, Speech and Signal Processing, 1995.
[BSWZ86] A. Bothorel, P. Simon, F. Wioland, and J. P. Zerling. Cineradiographie de
voyelles et consonnes du Francais. Institut de Phonetique de Strasburg,
1986.
126
BIBLIOGRAPHY 127

[CF66] C. Coker and O. Fujimura. Model for speci cation of the vocal tract area
function. The Journal of the Acoustical Society of America, 40:1271,
1966.
[CK41] T. Chiba and M. Kajiyama. The Vowel { Its Nature and Structure.
Tokyo, 1941.
[Cok76] C. Coker. A model of articulatory dynamics and control. Proc. IEEE,
64(4):452{460, 1976.
[Com94] P. Comon. Independent component analysis, a new concept? Signal
Processing, 36:287{314, 1994.

[Dav63] H. F. Davis. Fourier Series and Orthogonal Functions. Dover, 1963.


[Eis66] E. Eisner. Complete solutions of the `webster' horn equation. The Journal
of the Acoustical Society of America, 41(4):1126{1146, 1966.

[Fan67] G. Fant. On the predictability of formant levels and spectrum envelopes


from formant frequencies. In I. Lehiste, editor, Readings in Acoustic
Phonetics, pages 44{56. MIT, 1967. (Rep. from R. Jakobson, M. Halle,
and H. MacLean, eds., Mouton, 1956.).
[Fan70] G. Fant. Acoustic Theory of Speech Production. The Hague, 1970.
[Fan80] G. Fant. The relations between the area functions and the acoustic signal.
Phonetica, 37:55{86, 1980.

[FIS79] J. L. Flanagan, K. Ishizaka, and K. L. Shipley. Signal models for low bit-
rate coding of speech. The Journal of the Acoustical Society of America,
68(3):780{791, 1979.
[Fla55] J. L. Flanagan. A di erence limen for vowel formant frequency. The
Journal of the Acoustical Society of America, 27:613{617, 1955.

[Fla72] J. L. Flanagan. Speech Analysis, Synthesis, and Perception. Springer-


Verlag, 1972.
128 BIBLIOGRAPHY

[GS93] S. K. Gupta and J. Schroeter. Pitch synchronous frame-by-frame and


segment-based articulatory analysis by synthesis. The Journal of the
Acoustical Society of America, 94(5):2517{2530, 1993.
[HJ85] R. Horn and C. Johnson. Matrix Analysis. Cambridge, 1985.
[HS64] J. M. Heinz and K. N. Stevens. On the derivation of area functions and
acoustic spectra from cineradiographic lms of speech. The Journal of
the Acoustical Society of America, 36:1037, 1964.
[IS73] F. Itakura and S. Saito. Analysis synthesis telephony based on the maxi-
mum likelyhood method. In J. Flanagan and R. Rabiner, editors, Speech
Synthesis, pages 289{292. Dowden, Hutchinson & and Ross, 1973. (Rep.
from 6th Int. Cong. Acoust., Tokyo, 1968.).
[JN84] N. Jayant and P. Noll. Digital Coding of Waveforms. Springer-Verlag,
1984.
[Jor90] M. Jordan. Motor learning and the degrees of freedom problem. In
M. Jeannerod, editor, Attention and Performance, vol. XIII, pages 797{
836. Erlbaum, 1990.
[KL73] J. L. Kelly and C. C. Lochbaum. Speech synthesis. In J. Flanagan and
R. Rabiner, editors, Speech Synthesis, pages 127{130. Dowden, Hutchin-
son & and Ross, 1973. (Rep. from 4th Int. Cong. Acoust., Copenhagen,
1962.).
[Lin90a] Q. Lin. Speech production theory and articulatory speech synthesis. PhD
thesis, Royal Institute of Technology (KTH), Stockholm, 1990.
[Lin90b] B. Lindblom. Explaining phonetic variation: a sketch of the h & h theory.
In W. J. Hardcastle and A. Marchal, editors, Speech Production and
Speech Modelling, pages 403{439. Kluwer Academic Publishers, 1990.
[Mae72] S. Maeda. On the conversion of x-ray data into formant frequencies.
Technical report, Bell Laboratories, Murray Hill, N.J., 1972.
[Mae82] S. Maeda. A digital simulation method of the vocal-tract system. Speech
Communication, 1(3{4):199{229, 1982.
BIBLIOGRAPHY 129

[Mae90] S. Maeda. Compensatory articulation during speech: evidence from the


analysis and synthesis of vocal-tract shapes using an articulatory model.
In W. J. Hardcastle and A. Marchal, editors, Speech Production and
Speech Modelling, pages 131{149. Kluwer Academic Publishers, 1990.
[McG94] R. S. McGowan. Recovering articulatory movement from formant fre-
quency trajectories using task dynamics and a genetic algorithm: pre-
liminary model tests. Speech Communication, 14:19{48, 1994.
[Mer67] P. Mermelstein. Determination of vocal-tract shape from measured for-
mant frequencies. The Journal of the Acoustical Society of America,
41(5):1283{1294, 1967.
[Mer73] P. Mermelstein. Articulatory model for the study of speech production.
The Journal of the Acoustical Society of America, 53(4):1070{1082, 1973.
[MG76] J. D. Markel and A. H. Gray. Linear Prediction of Speech. Springer-
Verlag, 1976.
[Pap91] A. Papoulis. Probability, Random Variables, and Stochastic Processes.
McGraw-Hill, 1991.
[PBS92] P. Perrier, L. J. Boe, and R. Sock. Vocal tract area function estimation
from midsagittal dimensions with CT scans and a vocal tract cast: mod-
elling the transition with two sets of coecients. Journal of Speech and
Hearing Research, 35:53{67, 1992.
[PCS+ 92] J. S. Perkell, M. H. Cohen, M. A. Svirsky, M. L. Matthies, I. Garabieta,
and M. T. T. Jackson. Electromagnetic midsagittal articulometer sys-
tems for transducing speech articulatory movements. The Journal of the
Acoustical Society of America, 92(6):3078{3096, 1992.
[Per69] J. S. Perkell. Physiology of speech production: Results and implications
of a quantitative cineradiographic study. Master's thesis, M.I.T., 1969.
[RJ93] L. Rabiner and B. W. Juang. Fundamentals of Speech Recognition. Pren-
tice Hall, 1993.
130 BIBLIOGRAPHY

[RS78] L. Rabiner and R. Schafer. Digital Processing of Speech Signals. Prentice


Hall, 1978.
[Sal46] V. Salmon. Generalized plane wave horn theory. The Journal of the
Acoustical Society of America, 17(3):199{211, 1946.
[Sch67] M. R. Schroeder. Determination of the geometry of the human vocal-
tract by acoustical measurements. The Journal of the Acoustical Society
of America, 41(4):1002{1010, 1967.
[Scu90] C. Scully. Articulatory synthesis. In W. J. Hardcastle and A. Marchal,
editors, Speech Production and Speech Modelling, pages 151{186. Kluwer
Academic Publishers, 1990.
[Shi93] K. Shirai. Estimation and generation of articulatory motion using neural
networks. Speech Communication, 13:45{51, 1993.
[SK86] K. Shirai and T. Kobayashi. Estimating articulatory motion from the
speech wave. Speech Communication, 5:159{170, 1986.
[Son74] M. M. Sondhi. Model for wave propagation in a lossy vocal-tract. The
Journal of the Acoustical Society of America, 55(5):1070{1075, 1974.
[Son79] M. M. Sondhi. Estimation of vocal-tract areas: The need for acousti-
cal measurements. IEEE Transactions on Acoustics, Speech, and Signal
Processing, 27(3):268{273, 1979.
[SR83] M. M. Sondhi and J. R. Resnick. The inverse problem for the vocal-tract:
Numerical methods, acoustical experiments, and speech synthesis. The
Journal of the Acoustical Society of America, 73(3):985{1002, 1983.
[SS87] M. M. Sondhi and J. Schroeter. A hybrid time-frequency domain articu-
latory speech synthesizer. IEEE Transactions on Acoustics, Speech, and
Signal Processing, 35(7):955{967, 1987.
[SS91] J. Schroeter and M. Sondhi. Speech coding based on physiological models
of speech production. In M. M. Sondhi and S. Furui, editors, Advances
in Speech Processing, pages 231{268. Marcel Dekker, 1991.
BIBLIOGRAPHY 131

[SS94] J. Schroeter and M. M. Sondhi. Techniques for estimating vocal-tract


shapes from the speech signal. IEEE Transactions on Speech and Audio
Processing, 2(1):133{150, 1994.
[TY96] M. K. Tiede and H. Yehia. A shape-based approach to vocal tract area
function estimation. To appear in The Proceedings of the 1996 Joint
Meeting of the Acoustical Society of America and the Acoustical Society
of Japan, 1996.
[TYVB96] M. K. Tiede, H. Yehia, and E. Vatikiotis-Bateson. A shape-based ap-
proach to vocal tract area function estimation. In Proceedings of the 1st
ESCA Tutorial and Research Workshop on Speech Production Modeling
& 4th Speech Production Seminar, pages 41{44, 1996.
[VK94] V. Valimaki and M. Karjalainen. Improving the kelly-lochbaum vocal
tract model using conical tube sections and fractional delay ltering tech-
niques. In Proc. International Conference on Spoken Language Process-
ing, pages S12{12.1{S12{12.4, 1994.
[Wak73] H. Wakita. Direct estimation of the vocal-tract shape by inverse ltering
of the acoustic speech waveforms. IEEE Transactions on Audio and
Electroacoustics, AU-21(5):417{427, 1973.
[Wak79] H. Wakita. Estimation of vocal-tract shapes from acoustical analysis of
the speech wave: the state of the art. IEEE Transactions on Acoustics,
Speech, and Signal Processing, ASSP-27(3):281{285, 1979.
[Web19] A. G. Webster. Acoustical impedance, and the theory of horns and of
the phonograph. Proc. Natl. Acad. Sci. (U.S.), 5:275{282, 1919.
[YHI95a] H. Yehia, M. Honda, and F. Itakura. Acoustic measurements of the
vocal-tract area function: sensitivity analysis and experiments. In Proc.
IEEE International Conference on Acoustics, Speech and Signal Process-
ing, pages 652{655, 1995.
[YHI95b] H. Yehia, M. Honda, and F. Itakura. Acoustical measurements of the
vocal-tract area function: System modelling and experimental results.
132 BIBLIOGRAPHY

In Proceedings of the 1995 Spring Meeting of the Acoustical Society of


Japan, pages 305{306, 1995.
[YI93a] H. Yehia and F. Itakura. Dynamic vocal-tract shape determination
from formant frequencies using two-dimensional Fourier analysis. SP-92
143, Institute of Electronics, Information and Communication Engineers,
1993.
[YI93b] H. Yehia and F. Itakura. Variational and perturbation analysis applied to
determination of vocal-tract formants. In Proceedings of the 1993 Autumn
Meeting of the Acoustical Society of Japan, pages 285{286, 1993.
[YI94] H. Yehia and F. Itakura. Determination of human vocal-tract dynamic
geometry from formant trajectories using spatial and temporal Fourier
analysis. In Proc. IEEE International Conference on Acoustics, Speech
and Signal Processing, pages 477{480, 1994.
[YI95a] H. Yehia and F. Itakura. Analysis of a technique to measure the vocal-
tract cross-sectional area based on the impulse response at the lips. SP-94
107, Institute of Electronics, Information and Communication Engineers,
1995.
[YI95b] H. Yehia and F. Itakura. Combining dynamic and acoustic constraints in
the speech production inverse problem. SP-95 13, Institute of Electronics,
Information and Communication Engineers, 1995.
[YI96] H. Yehia and F. Itakura. A method to combine acoustical and mor-
phological constraints in the speech production inverse problem. Speech
Communication, 18(2):151{174, 1996.
[YT97] H. Yehia and M. Tiede. A parametric three-dimensional model of the
vocal-tract based on MRI data. To appear in Proc. IEEE International
Conference on Acoustics, Speech and Signal Processing, 1997.
[YTI95] H. Yehia, K. Takeda, and F. Itakura. A vocal-tract area function trajec-
tory representation oriented to the speech production inverse problem.
In Proceedings of the 1995 Autumn Meeting of the Acoustical Society of
Japan, pages 339{340, 1995.
BIBLIOGRAPHY 133

[YTI96] H. Yehia, K. Takeda, and F. Itakura. An acoustically oriented vocal-


tract model. IEICE Transactions on Information and Systems, E79-
D(8):1198{1208, 1996.
[YTVBI96] H. Yehia, M. K. Tiede, E. Vatikiotis-Bateson, and F. Itakura. Apply-
ing morphological constraints to estimate three-dimensional vocal-tract
shapes from partial pro le and acoustic information. To appear in The
Proceedings of the 1996 Joint Meeting of the Acoustical Society of Amer-
ica and the Acoustical Society of Japan, 1996.
List of Publications
Journal Papers
H. Yehia and F. Itakura, \A method to combine acoustical and morphological con-
straints in the speech production inverse problem," Speech Communication,
18(2):151{174, 1996.
H. Yehia, K. Takeda, and F. Itakura, \An acoustically oriented vocal-tract model,"
IEICE Transactions on Information and Systems, E79-D(8):1198{1208, 1996.
H. Yehia, K. Takeda, and F. Itakura, \An analysis of the acoustic-to-articulatory
mapping during speech under morphological and continuity constraints," sub-
mitted to Speech Communication.

International Conferences
H. Yehia and F. Itakura, \Determination of human vocal-tract dynamic geome-
try from formant trajectories using spatial and temporal Fourier analysis," In
Proc. IEEE International Conference on Acoustics, Speech and Signal Process-
ing, pages 477{480, 1994.
H. Yehia, M. Honda, and F. Itakura, \Acoustic measurements of the vocal-tract area
function: sensitivity analysis and experiments," In Proc. IEEE International
Conference on Acoustics, Speech and Signal Processing, pages 652{655, 1995.
M. K. Tiede, H. Yehia, and E. Vatikiotis-Bateson, \A shape-based approach to vocal
tract area function estimation," In Proceedings of the 1st ESCA Tutorial and
Research Workshop on Speech Production Modeling & 4th Speech Production
Seminar, pages 41{44, 1996.
134
PUBLICATIONS 135

E. Vatikiotis-Bateson, K. G. Munhall M. Hirayama, Y. Kasahara, and H. Yehia,


\Physiology-Based Synthesis of Audiovisual Speech," In Proceedings of the 1st
ESCA Tutorial and Research Workshop on Speech Production Modeling & 4th
Speech Production Seminar, pages 241{244, 1996.
E. Vatikiotis-Bateson, K. G. Munhall, Y. Kasahara F. Garcia, and H. Yehia, \Char-
acterizing audiovisual information during speech," In Proceedings of the Inter-
national Conference on Spoken Language Processing, pages 1485{1488, 1996.
H. Yehia, M. K. Tiede, E. Vatikiotis-Bateson, and F. Itakura, \Applying morpho-
logical constraints to estimate three-dimensional vocal-tract shapes from partial
pro le and acoustic information," In The Proceedings of the 1996 Joint Meeting
of the Acoustical Society of America and the Acoustical Society of Japan, pages
855{860, 1996.
T. Taniguchi, H. Yehia, S. Kajita, T. Takeda and F. Itakura, \On the problems of
applying Bell's blind separation to real environments," In The Proceedings of
the 1996 Joint Meeting of the Acoustical Society of America and the Acoustical
Society of Japan, pages 1257{1260, 1996.
M. K. Tiede and H. Yehia, \A shape-based approach to vocal tract area function
estimation," In The Proceedings of the 1996 Joint Meeting of the Acoustical
Society of America and the Acoustical Society of Japan, pages 861{866, 1996.
E. Vatikiotis-Bateson and H. Yehia, \Synthesizing audiovisual speech from physio-
logical signals," In The Proceedings of the 1996 Joint Meeting of the Acoustical
Society of America and the Acoustical Society of Japan, pages 811{816, 1996.
H. Yehia and M. Tiede, \A parametric three-dimensional model of the vocal-tract
based on MRI data," To appear in The Proceedings of ICASSP-97, 1997.

Technical Meetings and Symposia


H. Yehia and F. Itakura, \A method to estimate LPC parameters exploring frame
segmentation," In Proceedings of the 1992 Spring Meeting of the Acoustical
Society of Japan, pages 305{306, 1992.
136 PUBLICATIONS

H. Yehia and F. Itakura, \Dynamic vocal-tract shape determination from formant


frequencies using two-dimensional fourier analysis," SP-92 143, Institute of Elec-
tronics, Information and Communication Engineers, pages 49{56, 1993.
H. Yehia and F. Itakura, \Variational and perturbation analysis applied to determi-
nation of vocal-tract formants," In Proceedings of the 1993 Autumn Meeting of
the Acoustical Society of Japan, pages 285{286, 1993.
H. Yehia and F. Itakura, \Analysis of a technique to measure the vocal-tract cross-
sectional area based on the impulse response at the lips," SP-94 107, Institute
of Electronics, Information and Communication Engineers, pages 69{76, 1995.
H. Yehia, M. Honda, and F. Itakura, \Acoustical measurements of the vocal-tract
area function: System modelling and experimental results," In Proceedings of
the 1995 Spring Meeting of the Acoustical Society of Japan, pages 305{306, 1995.
H. Yehia and F. Itakura, \Combining dynamic and acoustic constraints in the speech
production inverse problem," SP-95 13, Institute of Electronics, Information and
Communication Engineers, pages 23{30, 1995.
H. Yehia, K. Takeda, and F. Itakura, \A vocal-tract area function trajectory rep-
resentation oriented to the speech production inverse problem," In Proceedings
of the 1995 Autumn Meeting of the Acoustical Society of Japan, pages 339{340,
1995.
I. Masuda, H. Yehia, and H. Kawahara, \A study of a method for signal separa-
tion by spectral interporation using bartlett window properties (in Japanese),"
EA-96 29, Institute of Electronics, Information and Communication Engineers,
pages 17{24 1996.
E. Vatikiotis-Bateson and H. Yehia, \Physiological modeling of facial motion during
speech," H-96 65, The Acoustical Society of Japan, 1996.
H. Yehia, \Vocal-tract pro le to area function mapping taking formant frequency
constraints into account," In Proceedings of the 1996 Autumn Meeting of the
Acoustical Society of Japan, pages 321{322 1996.

Anda mungkin juga menyukai