com
Introduction
Definition
Overview
The voice portal market is exploding with enormous opportunities
for service providers to grow business and revenues. Voice based
internet access uses rapidly advancing speech recognition technology to
give users any time, anywhere communication and access-the Human
Voice- over an office, wireless, or home phone. Here we would describe
the various technology factors that are making voice portal the next big
opportunity on the web, as well as the various approaches service
providers and developers of voice portal solutions can follow to
maximize this exciting new market opportunity.
Voice User Interface www.bestneo.com
Why Voice?
One key NGN service is voice portal, which provide callers with
anywhere, anytime access to information like news, weather, stock
quotes, and account balances using simple voice commands and any
telephone. Voice portals are poised to become the next big thing in
communications. The organizations using them will have a very real
edge in the market place-differentiating themselves from their
competitors, attracting loyal customers and growing their revenues.
Speaker Verification
The speaker-specific characteristics of speech are due to
differences in physiological and behavioral aspects of the speech
production system in humans. The main physiological aspect of the
human speech production system is the vocal tract shape. The vocal tract
is generally considered as the speech production organ above the vocal
folds, which consists of the following: (i) laryngeal pharynx (beneath the
epiglottis), (ii) oral pharynx (behind the tongue, between the epiglottis
and velum), (iii) oral cavity (forward of the velum and bounded by the
lips, tongue, and palate), (iv) nasal pharynx (above the velum, rear end
Voice User Interface www.bestneo.com
of nasal cavity), and (v) nasal cavity (above the palate and extending
from the pharynx to the nostrils). The shaded area in figure 1 depicts the
vocal tract.
Figure 1:
Figure 2:
The acoustic wave is produced when the airflow from the lungs
is carried by the trachea through the vocal folds. This source of
excitation can be characterized as phonation, whispering, frication,
compression, vibration, or a combination of these. Phonated excitation
occurs when the airflow is modulated by the vocal folds. Whispered
excitation is produced by airflow rushing through a small triangular
opening between the arytenoid cartilage at the rear of the nearly closed
vocal folds. Frication excitation is produced by constrictions in the vocal
tract. Compression excitation results from releasing a completely closed
and pressurized vocal tract. Vibration excitation is caused by air being
forced through a closure other than the vocal folds, especially at the
tongue. Speech produced by phonated excitation is called voiced, that
produced by phonated excitation plus frication is called mixed voiced,
and that produced by other types of excitation is called unvoiced.
Figure 3:
Voice User Interface www.bestneo.com
Choice of features :
The LPC features were very popular in the early speech-recognition and
speaker-verification systems. However, comparison of two LPC feature
vectors requires the use of computationally expensive similarity
measures such as the Itakura-Saito distance and hence LPC features are
unsuitable for use in real-time systems. Furui suggested the use of the
cepstrum, defined as the inverse Fourier transform of the logarithm of
the magnitude spectrum, in speech-recognition applications. The use of
the cepstrum allows for the similarity between two cepstral feature
vectors to be computed as a simple Euclidean distance. Furthermore,
Atal has demonstrated that the cepstrum derived from the LPC features
results in the best performance in terms of FAR and FRR for a speaker
verification system. Consequently, we have decided to use the LPC
derived cepstrum for our speaker verification system.
Speaker Modeling :
Using cepstral analysis as described in the previous section, an
utterance may be represented as a sequence of feature vectors.
Utterances spoken by the same person but at different times result in
similar yet a different sequence of feature vectors. The purpose of voice
modeling is to build a model that captures these variations in the
extracted set of features. There are two types of models that have been
used extensively in speaker verification and speech recognition systems:
stochastic models and template models. The stochastic model treats the
speech production process as a parametric random process and assumes
that the parameters of the underlying stochastic process can be estimated
in a precise, well defined manner. The template model attempts to model
the speech production process in a non-parametric manner by retaining a
number of sequences of feature vectors derived from multiple utterances
of the same word by the same person. Template models dominated early
work in speaker verification and speech recognition because the template
model is intuitively more reasonable. However, recent work in stochastic
models has demonstrated that these models are more flexible and hence
allow for better modeling of the speech production process. A very
popular stochastic model for modeling the speech production process is
the Hidden Markov Model (HMM). HMMs are extensions to the
conventional Markov models, wherein the observations are a
probabilistic function of the state, i.e., the model is a doubly embedded
stochastic process where the underlying stochastic process is not directly
observable (it is hidden). The HMM can only be viewed through another
set of stochastic processes that produce the sequence of observations.
Voice User Interface www.bestneo.com
Pattern Matching :
The pattern matching process involves the comparison of a given
set of input feature vectors against the speaker model for the claimed
identity and computing a matching score. For the Hidden Markov models
discussed above, the matching score is the probability that a given set of
feature vectors was generated by the model.
Conclusion .
Speech-enabled Internet portals or voice portals are quickly
becoming the hottest trend in E-commerce – broadening the access to
internet content to everyone with the most universal communications
device of all, a telephone. Voice portals purt all kinds of information at a
customers fingertips anytime, anywhere. Customers just dial into the
voice portal’s prescribed number and they use simple voice commands to
access whatever information they need. It’s quick, easy and effective,
even from a car or the airport.
Today’s voice portal are just the tip of the iceberg-the first step in
changing the way people access Internet content and, ultimately, how
businesses and consumers will conduct business over the Internet. Over
the next few years, voice portals-and core technologies behind then are
posed to businesses view and Internet with their customers.