Zue Lecture2

Acoustic Properties of Speech Sounds
Speech production
Signal processing
Properties of speech sounds of American English
Microphone variations
Spectrographic Examples
CLSP Workshop 2000
Acoustic Properties of Speech Sounds 1
Anatomical Structures for Speech Production
Soft Palate
(Velum)
Nasal CavityCavity
Nasal
Hard Palate
Hard Palate
Soft Palate
(Velum)
Hyoid Bone
Epiglottis
Cricoid
Cartilage
Esophagus
Tongue
Tongue
Jaw
Thyroid Cartilage
Thyroid Cartilage
Vocal Cords
Vocal Folds
Trachea
Trachea
Lung
Lung
Sternum
CLSP Workshop 2000
Sub-Word Linguistic Units

The phoneme is one of the most basic linguistic units used to
represent pronunciations of words
ASR systems typically represent words as phoneme sequences
English contains approximately 40 phonemes which can be
grouped by manner and place of articulation
Manner Class
Vowels
Fricatives
Stops
Semivowels
Nasals
Aricates
Aspirant
Number
16
8
6
4
3
2
1
CLSP Workshop 2000
Phonemes in American English

IPA
/i/
/I/
/e/
/E/
/@/
/a/
/O/
/^/
/o/
/U/
/u/
/5/
/a/
/O/
/a/
/{/
CLSP Workshop 2000
AB
iy
ih
ey
eh
ae
aa
ao
ah
ow
uh
uw
er
ay
oy
aw
ax
Word
beat
bit
bait
bet
bat
bob
bought
but
boat
book
boot
bird
bite
Boyd
bout
about
IPA
/s/
/S/
/f/
/T/
/z/
/Z/
/v/
/D/
/p/
/t/
/k/
/b/
/d/
/g/
AB
s
sh
f
th
z
zh
v
dh
p
t
k
b
d
g
Word
see
she
fee
thief
z
Gigi
v
thee
pea
tea
key
bay
day
geese
IPA
/w/
/r/
/l/
/y/
/m/
/n/
/4/
/C/
/J/
/h/
AB
w
r
l
y
m
n
ng
ch
jh
hh
Word
wet
red
let
yet
meet
neat
sing
church
judge
heat
Places of Articulation for Speech Production
Alveopalatal
Alveolar
Labial
Dental
CLSP Workshop 2000
Velar
Palatal
Uvular
A Speech Waveform
Two plus seven is less than ten

CLSP Workshop 2000
Spectral Representations
Speech waveforms are usually sampled at rates varying from 8K
(telephone) to 20K (wide-band) samples/sec
ASR systems typically transform the waveform into a spectrum: a
sequence of frequency-based analyses usually performed at
regular intervals (e.g., 10 ms)
A short-time Fourier transform (STFT) performs a spectral analysis
on waveform segments small enough to be able to assume that
the speech signal is quasi-stationary
The waveform segment is created by a moving window, whose
type (e.g., Hamming) and duration (e.g., 5-25ms) have a
signicant impact on the resulting spectrum
A spectrogram is an image computed from the resulting
spectrum, which is often used to examine the waveform
CLSP Workshop 2000
Short-Time Fourier Transform

w [ 50 - m ]
w [ 100 - m ]
w [ 200 - m ]
x[m]
n = 50
n = 100
+
j
Xn (e
n = 200
w[n m]x[m]ejm
)=
m=
If n is xed, then it can be shown that:

Xn (ej ) =
1
2
W (ej )ejn X(ej(+) )d
The above equation is meaningful only if we assume that X(ej )

represents the Fourier transform of a signal whose properties continue
outside the window, or simply that the signal is zero outside the window.
In order for Xn (ej ) to correspond to X(ej ), W (ej ) must resemble an
impulse with respect to X(ej ).
CLSP Workshop 2000
Comparison of Windows
CLSP Workshop 2000
Comparison of Windows (contd)
CLSP Workshop 2000
A Wide-Band Speech Spectrogram

CLSP Workshop 2000
A Narrow-Band Speech Spectrogram

CLSP Workshop 2000
Spectral Averages: Corpus and Representation

TIMIT acoustic-phonetic corpus
phonetic transcription aligned with waveform
native speakers of American English (8 dialects)
8 sentences/speaker (dialect sentences excluded)
136 female, 326 male speakers (NIST train set)
3,696 utterances, 142,910 tokens
Mel-Frequency Spectral Coecients (MFSCs)
Mel-frequency scale (linear < 1 kHz, log > 1 kHz)
40 channels (200 Hz - 6.4 kHz)
25 ms Hamming window, 5 ms frame-rate
Average computed over entire phonetic token (for stops
spectral slice at release was used)
CLSP Workshop 2000
's
Rob
Happy"So inaccurate, yet so useful." Chart

Little Vowel
FRONT
F3
ink
Th
F2 Increases
I
u
lo
u
HIGH
w?
Yo u r pal 5 is
to
ay
ew
th
go!
^,{
U
BACK
is mig
ht
y
TENSE = Towards Edges

tends to be longer
MID
LOW
LAX = Towards Center

tends to be shorter
SCHWAS:
Plain ({) About /{bat/
Front (|) Roses /roz|z/
Retroflex (}) Forever /f}Ev5/
F1 Increases
CLSP Workshop 2000
Friendly Little Consonant Chart

"Somewhat more accurate, yet somewhat less useful."
The Semi-vowels:
y is like an extreme i
Place of Articulation
Stop
Fricative
Dental
pb
Alveolar
Palatal
td
w is like an extreme u
Velar
l is like an extreme o
kg
r is like an extreme 5
The Odds and Ends:
fv
TD
sz
Weak (Non-strident)
Nasal
Manner of Articulation
Labial
SZ
h (unvoiced h)
H (voiced h)
Strong (Strident)
F (flap)
F (nasalized flap)
? (glottal stop)
The Affricates:
Voicing: Unvoiced Voiced
C is like t+S
J is like d+Z
CLSP Workshop 2000
Vowel Production
No signicant constriction in the vocal tract
Usually produced with periodic excitation
Acoustic characteristics depend on the position of the jaw,
tongue, and lips
[i]
CLSP Workshop 2000
[@]
[a]
[u]
Vowels of American English
There are approximately 18 vowels in American English made up

of monothongs, diphthongs, and reduced vowels (schwas)
They are often described by the articulatory features: High/Low,
Front/Back, Retroexed, Rounded, and Tense/Lax
/i/
/I/
/e/
/E/
/@/
/a/
iy
ih
ey
eh
ae
aa
beat
bit
bait
bet
bat
Bob
/O/
/^/
/o/
/U/
/u/
/5/
ao
ah
ow
uh
uw
er
bought
but
boat
book
boot
Bert
CLSP Workshop 2000
/a/
/O/
/a/
[{]
[|]
[}]
ay
oy
aw
ax
ix
axr
bite
Boyd
bout
about
roses
butter
Vowel Formant Averages

Vowels are often characterized by F1, F2, and F3
High/Low is correlated with F1
Front/Back is correlated with F2
Retroexion is marked by a low F3
Female Speakers
Male Speakers
3500
3500
F3
F2
F1
F3
F2
F1
3000
Average Frequency (Hz)
Average Frequency (Hz)
3000
2500
2000
1500
1000
500
2500
2000
1500
1000
500
I e E @ a O ^ o U u 5 {
Vowel
CLSP Workshop 2000
I e E @ a O ^ o U u 5 {
Vowel
Vowel Formant Trajectories

Diphthongs can have signicant formant motion
Most vowels in American English are somewhat diphthongized
Female Speakers
2700
Male Speakers
2700
2500
2500
2300
2300
2100
1500
1300
5
u
1100
700
300
500
700
800
700
300
900
^
a
900
1100
600
F1
5
u
1300
400
1700
1500
900
1900
F2
F2
1900
1700
2100
400
500
CLSP Workshop 2000
600
F1
700
800
900
Vowel Durations
Each vowel has a dierent intrinsic duration
Schwas have distinctly shorter durations (50ms)
/I, E, ^, U/ are the shortest monothongs
Context can greatly inuence vowel duration
Female Speakers
Male Speakers
250
200
200
Average Duration (ms)
Average Duration (ms)
250
150
100
50
150
100
50
i I e E @ a O ^ o U u 5 { | a o a u
i I e E @ a O ^ o U u 5 { | a o a u
Vowel
Vowel
CLSP Workshop 2000
Fricative Production
Turbulence produced at narrow constriction
Constriction position determines acoustic characteristics
Can be produced with periodic excitation
[f]
[T]
[s]
[S]
CLSP Workshop 2000
Fricatives of American English
There are 8 fricatives in American English

They are often described by the features Strident/Non-Strident
(Strong/Weak), Voiced/Unvoiced
Four places of articulation: Labial, Dental, Alveolar, and Palatal
Type
Labial
Dental
Alveolar
Palatal
CLSP Workshop 2000
Unvoiced
/f/
f fee
/T/ th thief
/s/ s see
/S/ sh she
/v/
/D/
/z/
/Z/
Voiced
v v
dh thee
z z
zh Gigi
Fricative Energy
0.06
0.04
0.02
0.0
Probability Density unadjusted for frequency
NON-STRIDENT
STRIDENT
-100
-90
-80
-70
-60
-50
-40
Average Total Energy
Strident fricatives tend to be stronger than non-strident

CLSP Workshop 2000
Fricative Durations
12
10
8
6
4
2
0
Probability Density unadjusted for frequency
14
UNVOICED
VOICED
0.0
0.05
0.10
0.15
0.20
0.25
0.30
Duration
Voiced fricatives tend to be shorter than unvoiced

CLSP Workshop 2000
Nasal Production
Velum lowering results in airow through nasal cavity
Consonants produced with closure in oral cavity
Nasalized vowels have output through oral and nasal cavities
Nasal murmurs have similar spectral characteristics
[m]
[n]
CLSP Workshop 2000
[4]
Nasal Consonants of American English

Three places of articulation: Labial, Alveolar, and Velar
Always attached to a vowel, though can form an entire syllable in
unstressed environments ([n], [m], [4])
/4/ is always post-vocalic

Place identied by neighboring formant transitions
Type
Labial
Dental
Velar
CLSP Workshop 2000
Nasal
/m/ m me
/n/
n knee
/4/ ng sing
Nasal Durations
150
Duration (ms)
125
100
75
50
25
0
Singleton
Unvoiced
Cluster
Voiced
Cluster
Nasal consonants tend to be shorter in clusters with unvoiced

consonants, and longer with voiced consonants
CLSP Workshop 2000
Semivowel Production
Constriction in vocal tract, no turbulence
Slower articulatory motion than other consonants
Laterals form complete closure with tongue tip,
airow via sides of constriction
[w]
CLSP Workshop 2000
[y]
[r]
[l]
Semivowels of American English

There are 4 semivowels in American English
Always attached to a vowel, though /l/ can form an entire syllable
in unstressed environments ([l ])
Extreme articulation of a corresponding vowel
Similar formant positions
Generally weaker due to constriction
Type
Glides
Liquids
Semivowel
/w/ w wet
/y/ y yet
/r/
r red
/l/
l let
Nearest Vowel
/u/
/i/
/5/
/o/
CLSP Workshop 2000
Acoustic Properties of Semivowels

/w/ is characterized by a very low F1, F2
Typically a rapid spectral fallo above F2
/y/ is characterized by very low F1, very high F2
/r/ is characterized by a very low F3
Prevocalic F3 < medial F3 < postvocalic F3
/l/ is characterized by a low F1 and F2
Often presence of high frequency energy
Postvocalic /l/ characterized by minimal spectral discontinuity,
gradual motion of formants
CLSP Workshop 2000
Aspirant Production
/h/ in American English
Turbulence excitation at glottis
No constriction in the vocal tract, normal formant excitation
Coupling with subglottal system results in little energy in F1

region
Periodic excitation can be present in medial position
CLSP Workshop 2000
Stop Production
Complete closure in the vocal tract, pressure build up
Sudden release of the constriction, turbulence noise
Can have periodic excitation during closure
[b]
CLSP Workshop 2000
[d]
[g]
Stops of American English

There are 6 stop consonants in American English
Same places of articulation as nasal consonants
Unvoiced stops are typically aspirated
Voiced stops usually exhibit a voice-bar during closure
Information about formant transitions and release useful for
classication
Type
Labial
Dental
Velar
Voiced
/b/ b bee
/d/ d Dee
/g/ g geese
Unvoiced
/p/ p pea
/t/ t tea
/k/ k key
CLSP Workshop 2000
Singleton Stop Durations
VOT Duration (ms)
80
70
60
50
40
30
20
10
0
The voice onset time (VOT) of unvoiced stops

is longer than that of voiced stops
CLSP Workshop 2000
/s/-Stop Durations
VOT Duration (ms)
80
70
60
50
40
30
20
10
0
Unvoiced stops are unaspirated in /s/ stop sequences

CLSP Workshop 2000
Stop-Semivowel Durations
100
VOT Duration (ms)
90
Singletons
80
[Stop][Semivowel]
Clusters
70
60
50
40
30
20
10
0
Semivowels are partially devoiced in stop semivowel sequences
CLSP Workshop 2000
Voicing Cues for Stops
There are many voicing cues for a stop

CLSP Workshop 2000
Aricate Production
Alveolar-stop palatal-fricative pairs
Sudden release of the constriction, turbulence noise
Can have periodic excitation during closure
Aricates of American English

There are two aricates in American English
Voiced
/J/ jh judge
CLSP Workshop 2000
Unvoiced
/C/ ch church
Speech from a Close-Talking Microphone

16
0.0
0.1
0.2
Zero Crossing Rate
0.3
0.4
0.5
0.6
0.7
0.8
Time (seconds)
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
kHz 8
16
8 kHz
0
Total Energy
dB
dB
Energy -- 125 Hz to 750 Hz
dB
8
dB
8
Wide Band Spectrogram
4 kHz
kHz 4
0
Waveform
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
File: /server/users/jwc/latex/sum97/sennheiser.wav
Printed by jwc on Wed Jul 16 11:58:32 1997
Page: 1
The Thinker is a famous sculpture

CLSP Workshop 2000
Speech from a Omni-Directional Microphone

16
0.0
0.1
0.2
Zero Crossing Rate
0.3
0.4
0.5
0.6
0.7
0.8
Time (seconds)
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
kHz 8
16
8 kHz
0
Total Energy
dB
dB
dB
8
dB
8
4 kHz
kHz 4
0
Waveform
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
File: /server/users/jwc/latex/sum97/bk.wav
Page: 1

CLSP Workshop 2000
Speech over a Telephone Channel

16
0.0
0.1
0.2
Zero Crossing Rate
0.3
0.4
0.5
0.6
0.7
0.8
Time (seconds)
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
kHz 8
16
8 kHz
0
Total Energy
dB
dB
dB
8
dB
8
4 kHz
kHz 4
0
Waveform
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
File: /server/users/jwc/latex/sum97/telephone.wav
Page: 1

CLSP Workshop 2000

Zue Lecture2

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Zue Lecture2

Diunggah oleh

Hak Cipta:

Format Tersedia

Acoustic Properties of Speech Sounds

Properties of speech sounds of American English

CLSP Workshop 2000

Acoustic Properties of Speech Sounds 1

Anatomical Structures for Speech Production

CLSP Workshop 2000

Acoustic Properties of Speech Sounds 2

Sub-Word Linguistic Units

CLSP Workshop 2000

Acoustic Properties of Speech Sounds 3

Phonemes in American English

Acoustic Properties of Speech Sounds 4

Places of Articulation for Speech Production

CLSP Workshop 2000

Acoustic Properties of Speech Sounds 5

Two plus seven is less than ten

Acoustic Properties of Speech Sounds 6

Acoustic Properties of Speech Sounds 7

Short-Time Fourier Transform

If n is xed, then it can be shown that:

W (ej )ejn X(ej(+) )d

The above equation is meaningful only if we assume that X(ej )

Acoustic Properties of Speech Sounds 8

CLSP Workshop 2000

Acoustic Properties of Speech Sounds 9

Comparison of Windows (contd)

CLSP Workshop 2000

Acoustic Properties of Speech Sounds 10

A Wide-Band Speech Spectrogram

Two plus seven is less than ten

Acoustic Properties of Speech Sounds 11

A Narrow-Band Speech Spectrogram

Two plus seven is less than ten

Acoustic Properties of Speech Sounds 12

Spectral Averages: Corpus and Representation

Acoustic Properties of Speech Sounds 13

Happy"So inaccurate, yet so useful." Chart

TENSE = Towards Edges

LAX = Towards Center

CLSP Workshop 2000

Acoustic Properties of Speech Sounds 14

Friendly Little Consonant Chart

Voicing: Unvoiced Voiced

CLSP Workshop 2000

Acoustic Properties of Speech Sounds 15

CLSP Workshop 2000

Acoustic Properties of Speech Sounds 16

Vowels of American English

There are approximately 18 vowels in American English made up

CLSP Workshop 2000

Acoustic Properties of Speech Sounds 17

Vowel Formant Averages

Average Frequency (Hz)

CLSP Workshop 2000

Acoustic Properties of Speech Sounds 18

Vowel Formant Trajectories

CLSP Workshop 2000

Acoustic Properties of Speech Sounds 19

Average Duration (ms)

Average Duration (ms)

CLSP Workshop 2000

Acoustic Properties of Speech Sounds 20

CLSP Workshop 2000