Anda di halaman 1dari 17

Automatic Speech Recognition

What is the task?

Getting a computer to understand spoken


language
By understand we might mean

React appropriately
Convert the input speech into another medium,
e.g. text

How do humans do it?

Articulation produces
sound waves which
the ear conveys to the brain
for processing
3

How computers do it?


Acoustic waveform

Acoustic signal

Digitization
Acoustic analysis of the speech
signal
Phoneme dictionary
Language model

Speech recognition

Multilingual Architecture

Multilingual speakers already out-number


monolingual speakers.
The capacity to transparently recognize multiple
spoken languages is a desirable feature of ASR
systems.
eg. OK GOOGLE, SIRI

Multilingual Techniques

Universal Speech Model

Language Identification (LID) classifiers

Monolingual speech recognizers decode along


with LID (Confidence Score)
Dynamic confidence score and LID decision

ASR Multilingual Design

The end-to-end multilingual speech recognition system consists of the


following components:
1. Client
2. Frontend
-Recognize
-Recognize+Search+Synthesis
-Multi-recognize+Search+Synthesis
3. Backend
-LID Backend
-Speech Recognizer Backend
-Web Search Backend
-Voice Synthesizer Backend
9

10

Multirecognizer Module

11

Representation of Speech & Speech


Signal

Grammar & Syntax

-How the occurrence of words in sequence is governed

Lexicon or Dictionary

- How a word is supposed to be pronounced as a


sequence of unitary sounds

Acoustic-phonetics

-How a unitary sound and/or a sequence of unitary sounds


are supposed to be produced with the articulatory
apparatus
12

THE HIDDEN MAROV MODEL

The input audio waveform from a microphone is converted into a sequence of


fixed size acoustic vectors Y 1: T = y 1. . . y T in a process called feature
extraction[3]. The decoder then attempts to find the sequence of words w 1: L =
w 1. . . w L which is most likely to have generated Y, i.e. the decoder tries to
find,
w = arg max {P (w|Y)}.
However, since P (w|Y) is difficult to model directly, Bayes Rule is used
to transform above equation into the equivalent problem of finding:
w = arg max {p(Y |w) P (w)}

13

Arcgitecture of HMM Based


Recognizer

14

The overall recognition system of speech recognition using HMM includes :

Feature Analysis

Unit Matching System

Lexical Decoding

Syntactic analysis

Semantic Analysis

15

Phoneme and Topologies

16

Composite HMM for Vertibri Recogition (Pronunciation Dictionary)

17

Anda mungkin juga menyukai