Anda di halaman 1dari 21

Acoustic Properties of Speech Sounds

Speech production

Signal processing

Properties of speech sounds of American English

Microphone variations

Spectrographic Examples

CLSP Workshop 2000

Acoustic Properties of Speech Sounds 1

Anatomical Structures for Speech Production

Soft Palate
(Velum)

Nasal CavityCavity
Nasal
Hard Palate
Hard Palate

Soft Palate
(Velum)
Hyoid Bone
Epiglottis
Cricoid
Cartilage
Esophagus

Tongue
Tongue

Jaw
Thyroid Cartilage
Thyroid Cartilage
Vocal Cords

Vocal Folds
Trachea
Trachea
Lung

Lung

Sternum

CLSP Workshop 2000

Acoustic Properties of Speech Sounds 2

Sub-Word Linguistic Units


The phoneme is one of the most basic linguistic units used to
represent pronunciations of words
ASR systems typically represent words as phoneme sequences
English contains approximately 40 phonemes which can be
grouped by manner and place of articulation
Manner Class
Vowels
Fricatives
Stops
Semivowels
Nasals
Aricates
Aspirant

Number
16
8
6
4
3
2
1

CLSP Workshop 2000

Acoustic Properties of Speech Sounds 3

Phonemes in American English


IPA
/i/
/I/
/e/
/E/
/@/
/a/
/O/
/^/
/o/
/U/
/u/
/5/
/a/
/O/
/a/
/{/
CLSP Workshop 2000

AB
iy
ih
ey
eh
ae
aa
ao
ah
ow
uh
uw
er
ay
oy
aw
ax

Word
beat
bit
bait
bet
bat
bob
bought
but
boat
book
boot
bird
bite
Boyd
bout
about

IPA
/s/
/S/
/f/
/T/
/z/
/Z/
/v/
/D/
/p/
/t/
/k/
/b/
/d/
/g/

AB
s
sh
f
th
z
zh
v
dh
p
t
k
b
d
g

Word
see
she
fee
thief
z
Gigi
v
thee
pea
tea
key
bay
day
geese

IPA
/w/
/r/
/l/
/y/
/m/
/n/
/4/
/C/
/J/
/h/

AB
w
r
l
y
m
n
ng
ch
jh
hh

Word
wet
red
let
yet
meet
neat
sing
church
judge
heat

Acoustic Properties of Speech Sounds 4

Places of Articulation for Speech Production

Alveopalatal
Alveolar
Labial
Dental

CLSP Workshop 2000

Velar

Palatal

Uvular

Acoustic Properties of Speech Sounds 5

A Speech Waveform

Two plus seven is less than ten


CLSP Workshop 2000

Acoustic Properties of Speech Sounds 6

Spectral Representations
Speech waveforms are usually sampled at rates varying from 8K
(telephone) to 20K (wide-band) samples/sec
ASR systems typically transform the waveform into a spectrum: a
sequence of frequency-based analyses usually performed at
regular intervals (e.g., 10 ms)
A short-time Fourier transform (STFT) performs a spectral analysis
on waveform segments small enough to be able to assume that
the speech signal is quasi-stationary
The waveform segment is created by a moving window, whose
type (e.g., Hamming) and duration (e.g., 5-25ms) have a
signicant impact on the resulting spectrum
A spectrogram is an image computed from the resulting
spectrum, which is often used to examine the waveform
CLSP Workshop 2000

Acoustic Properties of Speech Sounds 7

Short-Time Fourier Transform


w [ 50 - m ]

w [ 100 - m ]

w [ 200 - m ]
x[m]

n = 50

n = 100

+
j

Xn (e

n = 200

w[n m]x[m]ejm

)=
m=

If n is xed, then it can be shown that:


Xn (ej ) =

1
2

W (ej )ejn X(ej(+) )d

The above equation is meaningful only if we assume that X(ej )


represents the Fourier transform of a signal whose properties continue
outside the window, or simply that the signal is zero outside the window.
In order for Xn (ej ) to correspond to X(ej ), W (ej ) must resemble an
impulse with respect to X(ej ).
CLSP Workshop 2000

Acoustic Properties of Speech Sounds 8

Comparison of Windows

CLSP Workshop 2000

Acoustic Properties of Speech Sounds 9

Comparison of Windows (contd)

CLSP Workshop 2000

Acoustic Properties of Speech Sounds 10

A Wide-Band Speech Spectrogram

Two plus seven is less than ten


CLSP Workshop 2000

Acoustic Properties of Speech Sounds 11

A Narrow-Band Speech Spectrogram

Two plus seven is less than ten


CLSP Workshop 2000

Acoustic Properties of Speech Sounds 12

Spectral Averages: Corpus and Representation


TIMIT acoustic-phonetic corpus
phonetic transcription aligned with waveform
native speakers of American English (8 dialects)
8 sentences/speaker (dialect sentences excluded)
136 female, 326 male speakers (NIST train set)
3,696 utterances, 142,910 tokens
Mel-Frequency Spectral Coecients (MFSCs)
Mel-frequency scale (linear < 1 kHz, log > 1 kHz)
40 channels (200 Hz - 6.4 kHz)
25 ms Hamming window, 5 ms frame-rate
Average computed over entire phonetic token (for stops
spectral slice at release was used)
CLSP Workshop 2000

Acoustic Properties of Speech Sounds 13

's

Rob

Happy"So inaccurate, yet so useful." Chart


Little Vowel

FRONT

F3
ink
Th

F2 Increases

I
u

lo

u
HIGH

w?

Yo u r pal 5 is

to
ay
ew
th

go!

^,{
U

BACK

is mig
ht
y

TENSE = Towards Edges


tends to be longer

MID

LOW

LAX = Towards Center


tends to be shorter

SCHWAS:
Plain ({) About /{bat/
Front (|) Roses /roz|z/
Retroflex (}) Forever /f}Ev5/

F1 Increases

CLSP Workshop 2000

Acoustic Properties of Speech Sounds 14

Friendly Little Consonant Chart


"Somewhat more accurate, yet somewhat less useful."

The Semi-vowels:
y is like an extreme i

Place of Articulation
Stop
Fricative

Dental

pb

Alveolar

Palatal

td

w is like an extreme u

Velar

l is like an extreme o

kg

r is like an extreme 5
The Odds and Ends:

fv

TD

sz

Weak (Non-strident)

Nasal

Manner of Articulation

Labial

SZ

h (unvoiced h)
H (voiced h)

Strong (Strident)

F (flap)

F (nasalized flap)

? (glottal stop)
The Affricates:

Voicing: Unvoiced Voiced

C is like t+S
J is like d+Z

CLSP Workshop 2000

Acoustic Properties of Speech Sounds 15

Vowel Production
No signicant constriction in the vocal tract
Usually produced with periodic excitation
Acoustic characteristics depend on the position of the jaw,
tongue, and lips

[i]

CLSP Workshop 2000

[@]

[a]

[u]

Acoustic Properties of Speech Sounds 16

Vowels of American English

There are approximately 18 vowels in American English made up


of monothongs, diphthongs, and reduced vowels (schwas)
They are often described by the articulatory features: High/Low,
Front/Back, Retroexed, Rounded, and Tense/Lax

/i/
/I/
/e/
/E/
/@/
/a/

iy
ih
ey
eh
ae
aa

beat
bit
bait
bet
bat
Bob

/O/
/^/
/o/
/U/
/u/
/5/

ao
ah
ow
uh
uw
er

bought
but
boat
book
boot
Bert

CLSP Workshop 2000

/a/
/O/
/a/
[{]
[|]
[}]

ay
oy
aw
ax
ix
axr

bite
Boyd
bout
about
roses
butter

Acoustic Properties of Speech Sounds 17

Vowel Formant Averages


Vowels are often characterized by F1, F2, and F3
High/Low is correlated with F1
Front/Back is correlated with F2
Retroexion is marked by a low F3
Female Speakers

Male Speakers

3500

3500
F3

F2

F1

F3

F2

F1

3000
Average Frequency (Hz)

Average Frequency (Hz)

3000
2500
2000
1500
1000
500

2500
2000
1500
1000
500

I e E @ a O ^ o U u 5 {
Vowel

CLSP Workshop 2000

I e E @ a O ^ o U u 5 {

Vowel

Acoustic Properties of Speech Sounds 18

Vowel Formant Trajectories


Diphthongs can have signicant formant motion
Most vowels in American English are somewhat diphthongized
Female Speakers
2700

Male Speakers
2700

2500

2500

2300

2300

2100

1500
1300

5
u

1100

700
300

500

700

800

700
300

900

^
a

900

1100

600
F1

5
u

1300

400

1700
1500

900

1900

F2

F2

1900
1700

2100

400

500

CLSP Workshop 2000

600
F1

700

800

900

Acoustic Properties of Speech Sounds 19

Vowel Durations
Each vowel has a dierent intrinsic duration
Schwas have distinctly shorter durations (50ms)
/I, E, ^, U/ are the shortest monothongs
Context can greatly inuence vowel duration
Female Speakers

Male Speakers
250

200

200

Average Duration (ms)

Average Duration (ms)

250

150

100

50

150

100

50

i I e E @ a O ^ o U u 5 { | a o a u

i I e E @ a O ^ o U u 5 { | a o a u

Vowel

Vowel

CLSP Workshop 2000

Acoustic Properties of Speech Sounds 20

Fricative Production
Turbulence produced at narrow constriction
Constriction position determines acoustic characteristics
Can be produced with periodic excitation

[f]

[T]

[s]

[S]

CLSP Workshop 2000

Acoustic Properties of Speech Sounds 21

Fricatives of American English

There are 8 fricatives in American English


They are often described by the features Strident/Non-Strident
(Strong/Weak), Voiced/Unvoiced
Four places of articulation: Labial, Dental, Alveolar, and Palatal
Type
Labial
Dental
Alveolar
Palatal

CLSP Workshop 2000

Unvoiced
/f/
f fee
/T/ th thief
/s/ s see
/S/ sh she

/v/
/D/
/z/
/Z/

Voiced
v v
dh thee
z z
zh Gigi

Acoustic Properties of Speech Sounds 22

Fricative Energy

0.06
0.04
0.02
0.0

Probability Density unadjusted for frequency

NON-STRIDENT
STRIDENT

-100

-90

-80

-70

-60

-50

-40

Average Total Energy

Strident fricatives tend to be stronger than non-strident


CLSP Workshop 2000

Acoustic Properties of Speech Sounds 23

Fricative Durations

12
10
8
6
4
2
0

Probability Density unadjusted for frequency

14

UNVOICED
VOICED

0.0

0.05

0.10

0.15

0.20

0.25

0.30

Duration

Voiced fricatives tend to be shorter than unvoiced


CLSP Workshop 2000

Acoustic Properties of Speech Sounds 24

Nasal Production
Velum lowering results in airow through nasal cavity
Consonants produced with closure in oral cavity
Nasalized vowels have output through oral and nasal cavities
Nasal murmurs have similar spectral characteristics

[m]

[n]

CLSP Workshop 2000

[4]

Acoustic Properties of Speech Sounds 25

Nasal Consonants of American English


Three places of articulation: Labial, Alveolar, and Velar
Always attached to a vowel, though can form an entire syllable in
unstressed environments ([n], [m], [4])

/4/ is always post-vocalic


Place identied by neighboring formant transitions
Type
Labial
Dental
Velar

CLSP Workshop 2000

Nasal
/m/ m me
/n/
n knee
/4/ ng sing

Acoustic Properties of Speech Sounds 26

Nasal Durations
150

Duration (ms)

125
100
75
50
25
0

Singleton

Unvoiced
Cluster

Voiced
Cluster

Nasal consonants tend to be shorter in clusters with unvoiced


consonants, and longer with voiced consonants
CLSP Workshop 2000

Acoustic Properties of Speech Sounds 27

Semivowel Production
Constriction in vocal tract, no turbulence
Slower articulatory motion than other consonants
Laterals form complete closure with tongue tip,
airow via sides of constriction

[w]

CLSP Workshop 2000

[y]

[r]

[l]

Acoustic Properties of Speech Sounds 28

Semivowels of American English


There are 4 semivowels in American English
Always attached to a vowel, though /l/ can form an entire syllable
in unstressed environments ([l ])
Extreme articulation of a corresponding vowel
Similar formant positions
Generally weaker due to constriction
Type
Glides
Liquids

Semivowel
/w/ w wet
/y/ y yet
/r/
r red
/l/
l let

Nearest Vowel
/u/
/i/
/5/
/o/

CLSP Workshop 2000

Acoustic Properties of Speech Sounds 29

Acoustic Properties of Semivowels


/w/ is characterized by a very low F1, F2
Typically a rapid spectral fallo above F2
/y/ is characterized by very low F1, very high F2
/r/ is characterized by a very low F3
Prevocalic F3 < medial F3 < postvocalic F3
/l/ is characterized by a low F1 and F2
Often presence of high frequency energy
Postvocalic /l/ characterized by minimal spectral discontinuity,
gradual motion of formants
CLSP Workshop 2000

Acoustic Properties of Speech Sounds 30

Aspirant Production

/h/ in American English

Turbulence excitation at glottis

No constriction in the vocal tract, normal formant excitation

Coupling with subglottal system results in little energy in F1


region

Periodic excitation can be present in medial position

CLSP Workshop 2000

Acoustic Properties of Speech Sounds 31

Stop Production
Complete closure in the vocal tract, pressure build up
Sudden release of the constriction, turbulence noise
Can have periodic excitation during closure

[b]

CLSP Workshop 2000

[d]

[g]

Acoustic Properties of Speech Sounds 32

Stops of American English


There are 6 stop consonants in American English
Same places of articulation as nasal consonants
Unvoiced stops are typically aspirated
Voiced stops usually exhibit a voice-bar during closure
Information about formant transitions and release useful for
classication
Type
Labial
Dental
Velar

Voiced
/b/ b bee
/d/ d Dee
/g/ g geese

Unvoiced
/p/ p pea
/t/ t tea
/k/ k key

CLSP Workshop 2000

Acoustic Properties of Speech Sounds 33

Singleton Stop Durations

VOT Duration (ms)

80
70
60
50
40
30
20
10
0

The voice onset time (VOT) of unvoiced stops


is longer than that of voiced stops

CLSP Workshop 2000

Acoustic Properties of Speech Sounds 34

/s/-Stop Durations

VOT Duration (ms)

80
70
60
50
40
30
20
10
0

Unvoiced stops are unaspirated in /s/ stop sequences


CLSP Workshop 2000

Acoustic Properties of Speech Sounds 35

Stop-Semivowel Durations
100

VOT Duration (ms)

90

Singletons

80

[Stop][Semivowel]
Clusters

70
60
50
40
30
20
10
0

Semivowels are partially devoiced in stop semivowel sequences

CLSP Workshop 2000

Acoustic Properties of Speech Sounds 36

Voicing Cues for Stops

There are many voicing cues for a stop


CLSP Workshop 2000

Acoustic Properties of Speech Sounds 37

Aricate Production
Alveolar-stop palatal-fricative pairs
Sudden release of the constriction, turbulence noise
Can have periodic excitation during closure

Aricates of American English


There are two aricates in American English
Voiced
/J/ jh judge
CLSP Workshop 2000

Unvoiced
/C/ ch church
Acoustic Properties of Speech Sounds 38

Speech from a Close-Talking Microphone


16

0.0
0.1
0.2
Zero Crossing Rate

0.3

0.4

0.5

0.6

0.7

0.8

Time (seconds)
0.9
1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

kHz 8

16
8 kHz

0
Total Energy

dB

dB
Energy -- 125 Hz to 750 Hz

dB
8

dB
8

Wide Band Spectrogram

4 kHz

kHz 4

0
Waveform

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

File: /server/users/jwc/latex/sum97/sennheiser.wav
Printed by jwc on Wed Jul 16 11:58:32 1997
Page: 1

The Thinker is a famous sculpture


CLSP Workshop 2000

Acoustic Properties of Speech Sounds 39

Speech from a Omni-Directional Microphone


16

0.0
0.1
0.2
Zero Crossing Rate

0.3

0.4

0.5

0.6

0.7

0.8

Time (seconds)
0.9
1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

kHz 8

16
8 kHz

0
Total Energy

dB

dB
Energy -- 125 Hz to 750 Hz

dB
8

dB
8

Wide Band Spectrogram

4 kHz

kHz 4

0
Waveform

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

File: /server/users/jwc/latex/sum97/bk.wav
Printed by jwc on Wed Jul 16 11:57:43 1997
Page: 1

The Thinker is a famous sculpture


CLSP Workshop 2000

Acoustic Properties of Speech Sounds 40

Speech over a Telephone Channel


16

0.0
0.1
0.2
Zero Crossing Rate

0.3

0.4

0.5

0.6

0.7

0.8

Time (seconds)
0.9
1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

kHz 8

16
8 kHz

0
Total Energy

dB

dB
Energy -- 125 Hz to 750 Hz

dB
8

dB
8

Wide Band Spectrogram

4 kHz

kHz 4

0
Waveform

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

File: /server/users/jwc/latex/sum97/telephone.wav
Printed by jwc on Wed Jul 16 11:59:12 1997
Page: 1

The Thinker is a famous sculpture


CLSP Workshop 2000

Acoustic Properties of Speech Sounds 41

Anda mungkin juga menyukai