Anda di halaman 1dari 46

Computational Audition at

AFRL/HE:
Past, Present, and Future

Dr. Timothy R. Anderson


Human Effectiveness Directorate
Air Force Research Laboratory
Biologically Based Signal Processing

• research, development and applications of:


– Biologically based algorithms
– Perceptually relevant features
– Human-centered metrics and models
– to improve robustness of speech processing
systems
AWACS

Sensor-decision maker- Future JAOC


shooter Speech Command & Control
Technologies

JAOC
Chem-bio Defense 2
Combat Plans
Environment
Why Is This Area Important?

• Present signal processing systems (i.e. speech and speaker


recognition, speech coding, etc.) are not robust in adverse
military environments.
• Biological principles offer potential to provide improved
performance in military environments.

3
Biologically Based Signal Processing

Dom inant
Strong
Favorable
Tenable
W eak
Em bryonic Growth M ature Aging

Technical Challenges Approach


• Identification and modeling of features and • Develop psychoacoustic testing procedures
processes used by biological systems • Characterize key features and processes
• Incorporation of those key features and • Developed human-centered model and metrics
processes into computationally efficient • Implement computationally efficient algorithms
algorithms and structures • Provide support to operational test and
warfighting exercises to evaluate system utility

4
Research Areas

• Cockpit Speech Recognition


• Robust Speech Recognition
– Monaural Speech Recognition
– Binaural Speech Recognition
– Auditory Model Front-ends
• Speaker Recognition/Verification
– Biologically Based Speaker ID
– Channel Robustness
– Speaker Recognizability Test
5
Phoneme Classification

• Kohonen Self-Organizing Feature Map


– 16 X 16
• 10 Speaker Database (TIMIT)
• 10 sentences/speaker
• Leaving one out method (per speaker)
• Features calculated with
– 16 ms window
– 5 ms frame step

6
TRADITIONAL VS. AUDITORY
MONAURAL

Phoneme Recognition Rate


50
45
40
35
30
% 25 AIM
20 MFCC
15
10
5
0
1 5 10 15 20 32 Clean
Signal-to-Noise Ratio (dB)

7
Binaural Speech Recognition

• Past
• Present
• Future

9
Binaural Speech Recognition

• Stereausis
• Cocktail Party Processor
• BAIM
• BINAP

10
EXPERIMENT SETUP

SOUND
SOURCE
X
X NOISE
SOURCE

11
MONAURAL VS. BINAURAL
COCKTAIL PARTY PROCESSOR

Phoneme Recognition Rate


50
45
40
35
30
% 25 CPP
20 MONO
15
10
5
0
1 5 10 15 20 32 Clean
Signal-to-Noise Ratio (dB)

12
MONAURAL VS. BINAURAL
AUDITORY IMAGE MODEL

Phoneme Recognition Rate


50
45
40
35
30
% 25 BAIM
20 AIM
15
10
5
0
1 5 10 15 20 32 Clean
Signal-to-Noise Ratio (dB)

13
BINAURAL

Phoneme Recognition Rate


50
45
40
35
30
% 25 CPP
20 BAIM
15
10
5
0
1 5 10 15 20 32 Clean
Signal-to-Noise Ratio (dB)

14
MONAURAL

Phoneme Recognition Rate


50
45
40
35
30
AIM
% 25
20 MONO
15
10
5
0
1 5 10 15 20 32 Clean
Signal-to-Noise Ratio (dB)

15
BAIM VS. CPP-AIM

Phoneme Recognition Rate


50
45
40
35
30 BAIM
% 25 AIM
20 CPP-AIM
15
10
5
0
1 5 10 15 20 32 Clean
Signal-to-Noise Ratio (dB)

16
COINCIDENCE

Phoneme Recognition Rate


50
45
40
35
30
BAIM
% 25
BINAP
20
15
10
5
0
1 5 10 15 20 32 Clean
Signal-to-Noise Ratio (dB)

17
MONAURAL, BINAURAL AND
TRADITIONAL

Phoneme Recognition Rate


50
45
40 CPP
35 BAIM
30 AIM
% 25 MONO
20 MFCC
15 BINAP
10 CPP-AIM
5
0
1 5 10 15 20 32 Clean
Signal-to-Noise Ratio (dB)

18
Binaural Speech Recognition

RESULTS
BINAURAL AUDITORY MODEL
PROVIDES BETTER REPRESENTATION
THAN TRADITIONAL TECHNIQUES:

TASK SPEECH RESULTS

PHONEME 7-12 dB BINAURAL


LOW TO HIGH SNR ADVANTAGE
RECOGNITION

19
Binaural Speech Recognition

• Past
• Present
– No Current Work
• Future

20
Binaural Speech Recognition

• Past
• Present
• Future
– Implement binaural ASR system
– Investigate further binaural fusion mechanisms
– Meeting room data
– Implement binaural system using AIM chips

21
Auditory Model Front Ends

• Past
• Present
• Future

22
Auditory Model Front Ends

• Tanner Research “Analog Speech Recognition”


– Implementation of AIM
– 56 channels Analog Filter bank
– Single SBUS board
– 1.5 X Real-time

23
Auditory Model Front Ends

• AFIT
– Designed Digital Implementation
• Middle ear, BMM, adaptive thresholding
– 32 channels per chip
– 300 Hz – 7 kHz
– 44.1 KHz sampling rate
– 2 chips provide 64 channels in real-time

24
Auditory Model Front Ends

• Past
• Present
– Single board system designed and prototyped - USB
– Current chip design undergoing debug
– Second fabrication run this fall
• Future

27
Auditory Model Front Ends

• Past
• Present
• Future
– Debug and verify chip fabrication
– Debug PC based real-time auditory model front end
– Implement complete end-to-end auditory ASR
– Investigate feedback mechanisms in auditory model
for ASR

28
Biologically Based SID

• Past
• Present
• Future

29
Biologically Based SID

• Auditory Models Investigated


– Payton’s Auditory Model (PAM)
– Auditory Image Model (AIM)
• VQ Codebook used to model speaker
• 37 Speakers from TIMIT (dr1,2 12F 25M)
– MFCC 94%
– PAM 67%
– AIM 91%

30
Biologically Based SID

• Past
• Present
• Future

31
Biologically Based SID

• Using perceptual features


– Formants, formant bandwidths, and pitch
• Voiced Frames
• Using GMM classifier
• Conducting experiments on larger databases
– Switchboard

32
Biologically Based SID
MFCCs,
no Deltas,
no CMS
MFCCs,
no CMS

F0 Base

33
Biologically Based SID
MFCCs,
MFCCs, no Deltas,
no CMS no CMS

F0 Base

34
Biologically Based SID

MFCCs,
no Deltas,
MFCCs, no CMS
no CMS

F0 Base

35
Biologically Based SID

MFCCs,
no Deltas,
MFCCs, no CMS
no CMS

F0 Base

36
Biologically Based SID

• Performance isn’t the best, but this feature set…


– Uses only 9 features versus 19–38 for MFCCs
– Hasn’t been as heavily researched as MFCCs

37
Biologically Based SID

• Determine reasons for performance differences


between various databases
• Channel & score normalizations
• Pitch-synchronous features
• Closed-phase analysis
• Glottal model features

38
Biologically Based SID

39
Biologically Based SID

• Past
• Present
• Future

40
Biologically Based SID

• Investigate other auditory based features


– Vocal agitation
– Formants, formant bandwidths, and pitch calculated
from the auditory model
– Auditory model features
• Conduct experiments on other databases
– Broadcast news
– Military training exercises

41
Speaker Recognizability Test

• Past
• Present
• Future

42
Speaker Recognizability Test

• Dynastat “The Development of a Method for


Evaluating and Predicting Speaker Recognizability in
Voice Communication Systems”
– Determined perceptually relevant features
• Perceptual voice traits (PVT)
• 21 traits currently identified
– Developed methodology to measure these traits
• Human listeners
– Developed measure to determine loss due to channel
• Diagnostic Speaker Recogniziability Test (DSRT)

43
Speaker Recognizability Test

• Past
• Present
• Future

44
Speaker Recognizability Test

• Use perceptual voice traits to identify groups of


similar and distinctive speakers
• Determine if current SID systems have difficulty
with these similar speakers
• Implementing in-house
– Web-based listening test for
• PVT rating
• DSRT

45
Speaker Recognizability Test

• Past
• Present
• Future

46
Speaker Recognizability Test

• Obtain PVT ratings for larger database


– Switchboard
• Determine acoustic correlates of perceptually
relevant features
• Use as features for speaker recognition
• Utilize DSRT for communication system testing

47
Summary

• Computational Audition offers potential for


improved performance in adverse military
environments
• Still lots of research needs to be accomplished
– Fidelity of model
– Model feedback pathways
• Computation issues no longer limiting factor in
performing meanful experiments

48
Questions?

49

Anda mungkin juga menyukai