Speech production
Signal processing
Microphone variations
Spectrographic Examples
Soft Palate
(Velum)
Nasal CavityCavity
Nasal
Hard Palate
Hard Palate
Soft Palate
(Velum)
Hyoid Bone
Epiglottis
Cricoid
Cartilage
Esophagus
Tongue
Tongue
Jaw
Thyroid Cartilage
Thyroid Cartilage
Vocal Cords
Vocal Folds
Trachea
Trachea
Lung
Lung
Sternum
Number
16
8
6
4
3
2
1
AB
iy
ih
ey
eh
ae
aa
ao
ah
ow
uh
uw
er
ay
oy
aw
ax
Word
beat
bit
bait
bet
bat
bob
bought
but
boat
book
boot
bird
bite
Boyd
bout
about
IPA
/s/
/S/
/f/
/T/
/z/
/Z/
/v/
/D/
/p/
/t/
/k/
/b/
/d/
/g/
AB
s
sh
f
th
z
zh
v
dh
p
t
k
b
d
g
Word
see
she
fee
thief
z
Gigi
v
thee
pea
tea
key
bay
day
geese
IPA
/w/
/r/
/l/
/y/
/m/
/n/
/4/
/C/
/J/
/h/
AB
w
r
l
y
m
n
ng
ch
jh
hh
Word
wet
red
let
yet
meet
neat
sing
church
judge
heat
Alveopalatal
Alveolar
Labial
Dental
Velar
Palatal
Uvular
A Speech Waveform
Spectral Representations
Speech waveforms are usually sampled at rates varying from 8K
(telephone) to 20K (wide-band) samples/sec
ASR systems typically transform the waveform into a spectrum: a
sequence of frequency-based analyses usually performed at
regular intervals (e.g., 10 ms)
A short-time Fourier transform (STFT) performs a spectral analysis
on waveform segments small enough to be able to assume that
the speech signal is quasi-stationary
The waveform segment is created by a moving window, whose
type (e.g., Hamming) and duration (e.g., 5-25ms) have a
signicant impact on the resulting spectrum
A spectrogram is an image computed from the resulting
spectrum, which is often used to examine the waveform
CLSP Workshop 2000
w [ 100 - m ]
w [ 200 - m ]
x[m]
n = 50
n = 100
+
j
Xn (e
n = 200
w[n m]x[m]ejm
)=
m=
1
2
Comparison of Windows
's
Rob
FRONT
F3
ink
Th
F2 Increases
I
u
lo
u
HIGH
w?
Yo u r pal 5 is
to
ay
ew
th
go!
^,{
U
BACK
is mig
ht
y
MID
LOW
SCHWAS:
Plain ({) About /{bat/
Front (|) Roses /roz|z/
Retroflex (}) Forever /f}Ev5/
F1 Increases
The Semi-vowels:
y is like an extreme i
Place of Articulation
Stop
Fricative
Dental
pb
Alveolar
Palatal
td
w is like an extreme u
Velar
l is like an extreme o
kg
r is like an extreme 5
The Odds and Ends:
fv
TD
sz
Weak (Non-strident)
Nasal
Manner of Articulation
Labial
SZ
h (unvoiced h)
H (voiced h)
Strong (Strident)
F (flap)
F (nasalized flap)
? (glottal stop)
The Affricates:
C is like t+S
J is like d+Z
Vowel Production
No signicant constriction in the vocal tract
Usually produced with periodic excitation
Acoustic characteristics depend on the position of the jaw,
tongue, and lips
[i]
[@]
[a]
[u]
/i/
/I/
/e/
/E/
/@/
/a/
iy
ih
ey
eh
ae
aa
beat
bit
bait
bet
bat
Bob
/O/
/^/
/o/
/U/
/u/
/5/
ao
ah
ow
uh
uw
er
bought
but
boat
book
boot
Bert
/a/
/O/
/a/
[{]
[|]
[}]
ay
oy
aw
ax
ix
axr
bite
Boyd
bout
about
roses
butter
Male Speakers
3500
3500
F3
F2
F1
F3
F2
F1
3000
Average Frequency (Hz)
3000
2500
2000
1500
1000
500
2500
2000
1500
1000
500
I e E @ a O ^ o U u 5 {
Vowel
I e E @ a O ^ o U u 5 {
Vowel
Male Speakers
2700
2500
2500
2300
2300
2100
1500
1300
5
u
1100
700
300
500
700
800
700
300
900
^
a
900
1100
600
F1
5
u
1300
400
1700
1500
900
1900
F2
F2
1900
1700
2100
400
500
600
F1
700
800
900
Vowel Durations
Each vowel has a dierent intrinsic duration
Schwas have distinctly shorter durations (50ms)
/I, E, ^, U/ are the shortest monothongs
Context can greatly inuence vowel duration
Female Speakers
Male Speakers
250
200
200
250
150
100
50
150
100
50
i I e E @ a O ^ o U u 5 { | a o a u
i I e E @ a O ^ o U u 5 { | a o a u
Vowel
Vowel
Fricative Production
Turbulence produced at narrow constriction
Constriction position determines acoustic characteristics
Can be produced with periodic excitation
[f]
[T]
[s]
[S]
Unvoiced
/f/
f fee
/T/ th thief
/s/ s see
/S/ sh she
/v/
/D/
/z/
/Z/
Voiced
v v
dh thee
z z
zh Gigi
Fricative Energy
0.06
0.04
0.02
0.0
NON-STRIDENT
STRIDENT
-100
-90
-80
-70
-60
-50
-40
Fricative Durations
12
10
8
6
4
2
0
14
UNVOICED
VOICED
0.0
0.05
0.10
0.15
0.20
0.25
0.30
Duration
Nasal Production
Velum lowering results in airow through nasal cavity
Consonants produced with closure in oral cavity
Nasalized vowels have output through oral and nasal cavities
Nasal murmurs have similar spectral characteristics
[m]
[n]
[4]
Nasal
/m/ m me
/n/
n knee
/4/ ng sing
Nasal Durations
150
Duration (ms)
125
100
75
50
25
0
Singleton
Unvoiced
Cluster
Voiced
Cluster
Semivowel Production
Constriction in vocal tract, no turbulence
Slower articulatory motion than other consonants
Laterals form complete closure with tongue tip,
airow via sides of constriction
[w]
[y]
[r]
[l]
Semivowel
/w/ w wet
/y/ y yet
/r/
r red
/l/
l let
Nearest Vowel
/u/
/i/
/5/
/o/
Aspirant Production
Stop Production
Complete closure in the vocal tract, pressure build up
Sudden release of the constriction, turbulence noise
Can have periodic excitation during closure
[b]
[d]
[g]
Voiced
/b/ b bee
/d/ d Dee
/g/ g geese
Unvoiced
/p/ p pea
/t/ t tea
/k/ k key
80
70
60
50
40
30
20
10
0
/s/-Stop Durations
80
70
60
50
40
30
20
10
0
Stop-Semivowel Durations
100
90
Singletons
80
[Stop][Semivowel]
Clusters
70
60
50
40
30
20
10
0
Aricate Production
Alveolar-stop palatal-fricative pairs
Sudden release of the constriction, turbulence noise
Can have periodic excitation during closure
Unvoiced
/C/ ch church
Acoustic Properties of Speech Sounds 38
0.0
0.1
0.2
Zero Crossing Rate
0.3
0.4
0.5
0.6
0.7
0.8
Time (seconds)
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
kHz 8
16
8 kHz
0
Total Energy
dB
dB
Energy -- 125 Hz to 750 Hz
dB
8
dB
8
4 kHz
kHz 4
0
Waveform
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
File: /server/users/jwc/latex/sum97/sennheiser.wav
Printed by jwc on Wed Jul 16 11:58:32 1997
Page: 1
0.0
0.1
0.2
Zero Crossing Rate
0.3
0.4
0.5
0.6
0.7
0.8
Time (seconds)
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
kHz 8
16
8 kHz
0
Total Energy
dB
dB
Energy -- 125 Hz to 750 Hz
dB
8
dB
8
4 kHz
kHz 4
0
Waveform
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
File: /server/users/jwc/latex/sum97/bk.wav
Printed by jwc on Wed Jul 16 11:57:43 1997
Page: 1
0.0
0.1
0.2
Zero Crossing Rate
0.3
0.4
0.5
0.6
0.7
0.8
Time (seconds)
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
kHz 8
16
8 kHz
0
Total Energy
dB
dB
Energy -- 125 Hz to 750 Hz
dB
8
dB
8
4 kHz
kHz 4
0
Waveform
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
File: /server/users/jwc/latex/sum97/telephone.wav
Printed by jwc on Wed Jul 16 11:59:12 1997
Page: 1