Acoustic processing
Features or characteristics of the speech signal associated
with different words must be extracted.
The processor must recognize the same utterance from a fast,
high pitched speaker and a slow low pitched one.
Acoustic analysis includes investigation of
Phonological variation
Likely sequences of sounds
Word and syllable boundaries
Markov processes
Markov Models represent sequential movement from state
to state.
They model a linear stream of events, when an event at
time t depends on previous events.
Hidden Markov Models (HMMs) represent a sequence of
sounds when intermediate states may not be known.
They are the most widely used models at the core of
acoustic analysis.
Prosodic processing
Prosodic processing helps determine phrase and sentence
boundaries
Recognises intonation, minor and major pauses.
These provide important information for interpreting a
stream of sounds.
In systems that include dialog management it helps
determine utterancetype statement or question.
Legal reports
Tennis commentary
Geological reports
TV chat show
The likelihood of each word will vary with the domain. Training data
must come from an appropriate source.
the I court
If the high court has occurred once or more in the training data
while the others have not, there is a strong probability that this is
the correct one.
Improving predictability
Predictability can be improved by
Reducing the number of words that can be used, as in some telephone
dialog systems
Getting unigram (single word) probabilities
Getting bigram and trigram probabilities
Suppose we had probabilities for bigrams starting with the word prime
prime minister 0.98
prime cut
0.01
prime number
0.01
prime monkey 0.00
We have more predictability with this information than without it.
Methods of smoothing
Smoothing techniques are used so that events that are unseen in the
training data are not given zero probability.
Assume an unseen event has a probability related to that of an event
seen once. Share the probability of single events in the training data
with unseen events.
Back off from trigrams to bigrams, and from bigrams to unigrams,
since these are more likely to have occurred.
Use parts-of-speech instead of words: useful but computationally
expensive. Slows down the system
Applications (2)
The Subtitle project
Real time subtitles for sports programmes on BBC2 are produced using
speech recognition technology
The lecture capture project
Lecturers have their speech converted to text to help students with
impaired hearing.
Hypothesis to be investigated: some overseas students may understand
written English better than spoken
Verbmobil project
Speech recognition used as part of a larger application to translate spoken
German to Japanese. Limited progress in spite of huge resources.