Anda di halaman 1dari 16

Speech Recognition

Automated Speech Recognition, ASR

1. Acoustic processor takes in a signal and returns a set of
candidate words.
2. Language model ranks the candidate words in the most
likely order

Acoustic processing
Features or characteristics of the speech signal associated
with different words must be extracted.
The processor must recognize the same utterance from a fast,
high pitched speaker and a slow low pitched one.
Acoustic analysis includes investigation of
Phonological variation
Likely sequences of sounds
Word and syllable boundaries

Markov processes
Markov Models represent sequential movement from state
to state.
They model a linear stream of events, when an event at
time t depends on previous events.
Hidden Markov Models (HMMs) represent a sequence of
sounds when intermediate states may not be known.
They are the most widely used models at the core of
acoustic analysis.

Prosodic processing
Prosodic processing helps determine phrase and sentence
Recognises intonation, minor and major pauses.
These provide important information for interpreting a
stream of sounds.
In systems that include dialog management it helps
determine utterancetype statement or question.

Language Models (1)

Two probabilities are combined in the speech recogniser:
(i) the probability that a certain speech signal S would be
produced by a given word - from the acoustic processor
(ii) the probability that the word is w, given its context, - from
the language model (LM).
The language model helps determine the most likely word, not
necessarily at the top of the acoustic ranking.
Note different uses of the term model. Many models
encapsulate a theory. Here, as in Claws the LM is a set of statistics
drawn from training data.

Language Models (2)

Suppose the acoustic processor produces as candidate words
caught s(noozing)
These are homophones. The acoustic signals may be very similar.
Information from the training corpus is needed to help predict the right
Suppose the language domain is

Legal reports

Tennis commentary

Geological reports

TV chat show
The likelihood of each word will vary with the domain. Training data
must come from an appropriate source.

Language Models (3)

Basic concept: word frequencies and word patterns that occur in the
training data are likely to show up again. Information that helps get the
right word includes:
Frequency of words occurring in the training corpus.
Frequency of bigrams and trigrams occurring in the training corpus.
A bigram is a word pair, such as tennis courts. A trigram is a word triple.
In the sentence The tennis courts were too wet to play. we have the
The tennis courts, tennis courts were, courts were too,
were too wet, too wet to, wet to play, to play .

Language models (4)

Suppose the acoustic processor produces the candidate trigrams

the eye court

the I court

the high court

If the high court has occurred once or more in the training data
while the others have not, there is a strong probability that this is
the correct one.

Improving predictability
Predictability can be improved by
Reducing the number of words that can be used, as in some telephone
dialog systems
Getting unigram (single word) probabilities
Getting bigram and trigram probabilities
Suppose we had probabilities for bigrams starting with the word prime
prime minister 0.98
prime cut
prime number
prime monkey 0.00
We have more predictability with this information than without it.

Perplexity metrics to evaluate the language model

Informally speaking, perplexity is a measure of predictability. The lower
the perplexity, the higher the predictability of a sequence of words, the
better the language model.
Having a quantifiable metric has contributed to improvements in ASR.
High perplexity is associated with a large number of word choices, low
perplexity with fewer choices.
Perplexity is related to entropy, which, informally speaking , is a measure
of the degree of disorder , unpredictability, of a sequence.
If entropy is represented by the symbol H, and perplexity by P
P = 2H

Zipfian distribution of words

The sparse data problem
When we compare new, unseen texts with training texts we find that
words, bigrams and trigrams occur that were not in the training data.
Consider the Zipfian distribution of words:
A small number of function words are common but most of the ordinary
content words occur rarely.
This is an empirical observation. For example, in the Brown corpus of 1
million words from mixed sources 40% of the words only occur once.

Zipfian distribution and sparse data (2)

This is more pronounced for bigrams and trigrams.
In a corpus of 39 million words from news articles from the Wall Street
Journal, a limited domain, most single words are covered.
But 77% of trigrams in a new article will on average not have been seen

Methods of smoothing
Smoothing techniques are used so that events that are unseen in the
training data are not given zero probability.
Assume an unseen event has a probability related to that of an event
seen once. Share the probability of single events in the training data
with unseen events.
Back off from trigrams to bigrams, and from bigrams to unigrams,
since these are more likely to have occurred.
Use parts-of-speech instead of words: useful but computationally
expensive. Slows down the system

Improving the language model

In practice, perplexity declines as

The size of the training corpus increases.
Corpora of hundreds of millions of words are commonly used in
commercial products.
Information from bigrams and trigrams as well as single words is
included in the language model. Current ASR systems are based on
trigram language models.
The training corpus includes data from the same language domain
as test data.

Speech recognition applications

Use as stand alone systems with ordinary PCs.
Use by professionals who have to produce many reports lawyers,
surveyors etc.
Use in radiography where user can examine an image and speak
findings without turning away.
Afterwards the user will have to edit report to correct any errors.
Performance is typically 90-97% words correct.
Continuous speech is used.
Recognizers are trained by each user to customise for each voice.
To reach 90% level, about 10 minutes training is needed with some
products. Further training improves performance.

Applications (2)
The Subtitle project
Real time subtitles for sports programmes on BBC2 are produced using
speech recognition technology
The lecture capture project
Lecturers have their speech converted to text to help students with
impaired hearing.
Hypothesis to be investigated: some overseas students may understand
written English better than spoken
Verbmobil project
Speech recognition used as part of a larger application to translate spoken
German to Japanese. Limited progress in spite of huge resources.