Anda di halaman 1dari 28

Easy Does It: Robust Spectro-Temporal ManyStream ASR without Fine Tuning Streams

Ravuri, Morgan, UC Berkeley


Presented by JJ

Motivation
Physiological experiments in different mammal species : a large percentage of neurons in the primary auditory cortex (A1) respond differently to upwardversus downward-moving ripples in the spectrogram of the input (Depireux et al., 2001).

Spectro-temporal receptive fields (STRFs) : individual neurons are sensitive to specific spectrotemporal modulation frequencies in the incoming sound signal

Introduction
Cortically-inspired TF features, which capture spectral and temporal modulations speech recognition and discrimination. Basically, spectro-temporal features are derived from filtering spectrograms with particular filters. In this case, the GABOR filter is applied to the auditory spectrogram.

Example

Example

Gabor Filters

Example
Gaussian envelope

Gabor Filters

complex sinusoid s(n, k)

1D Gabor

Gaussian envelope

complex sinusoid s(n, k)

Gaussian envelope

2D Gabor
complex sinusoid s(n, k)

Example
Gaussian envelope

Gabor Filters

complex sinusoid s(n, k)

Their Gabor Filters

Their Gabor Filters

Dummy

parameters

indices

Tons of Combinations!

System
Stream

Stream

Merge MLP outputs

PCA

MFCC

Output

System
Stream

Stream

Merge MLP outputs

PCA

MFCC

Output

System
Stream

Stream

MLP (Multilayer Perceptron) The structure of the MLP depends on the type of feature and corpus.
Number of input units Spectral 567 9 Cepstral 351 9

56D Merge MLP outputs 56D

frames of context

hidden units

160 for Aurora2 500 for Number95 56

160 for Aurora2 500 for Number95 56

PCA
32D 45D MFCC Output
output units

System
Stream

Stream

56D Merge MLP outputs 56D

The outputs of the MLP stream provide an estimate of the posterior probability distribution for phones. Then, combine each of these phone probability estimates across streams by inverse entropy.

PCA
32D 71D MFCC Output

System
Stream

Stream

then apply the KL Transform to the log probabilities of the merged MLPs

56D Merge MLP outputs 56D

PCA
32D 71D MFCC Output Principal Components Analysis

System
Stream

Stream

56D Merge MLP outputs 56D

PCA
32D 71D MFCC Output

then apply the KL Transform to the log probabilities of the merged MLPs reduced to 32D orthogonalized the features are mean and variance normalized by utterance finally appended to the MFCC feature

System
Features HMM
Stream

Stream

56D Merge MLP outputs 56D

PCA
32D 71D MFCC 39D Output 32D

Experiments
Database Aurora 2 (0 20 dB) Numbers95 consists of various numeric portions extracted from telephone dialogues . vocabulary size of 32 words training set contains 3590 utterances of clean data, totaling roughly 3 hrs 2 test sets contains 1227 utterances. The first contains only clean data The second contains the same utterances with noise added at five SNR (20dB, 15dB, 10dB, 5dB, and 0dB). Additive noise Baseline 39 MFCC 4-stream system 28-stream system

Uni-modulation system 150 stream spectral only and spectral/cepstral

Metric: Word Error Rate (WER)

Results
Aurora 2

Numbers 95

Results
Aurora 2

Numbers 95

Results
Aurora 2

Numbers 95

Results
Aurora 2

Discussion 1

Numbers 95

Results
Aurora 2

Discussion 2

Numbers 95

Results
Aurora 2

Discussion 3

Numbers 95

Results
Aurora 2

Numbers 95

Future Work
Stream

Stream

56D Merge MLP outputs 56D

Not just additive noise Another TF feature might not work Log-mel filterbank? Or power like PNCC? How to combine MLP? Inverse Entropy?

PCA
32D 71D MFCC 39D Output 32D

Anda mungkin juga menyukai