Anda di halaman 1dari 7

Methods to Objectively Evaluate Speech

Quality
Author: Richard Jobson
President
Teraquant Corporation

Abstract
This paper gives an overview of methods to objectively evaluate perceptual quality of speech. The science of
perceptual speech quality measurements assessment has been progressing for the past two decades. Current
state-of-the-art for full referenced measurements uses the third iteration of the standard. Good-quality
implementations of the standards of the algorithms contained within the standards will deliver the 97%
correlation with the subjective tests database depending on implantation of the test set up.
Commercial test solutions exist to measure speech from acoustic, electronic analog and digital interfaces.
Mean Opinion Score [MOS] is a scale used in voice telecommunications predicting the perceived quality of a
speech sample. MOS describes speech clarity or intelligibility. The measure does not measure delay across a
network or Echo. The scales runs from 1 to 5, a score of 1 is bad and 5 is excellent. MOS [Mean Opinion Score]
test sessions comprise 15 to 25 people listening to speech files of good quality and of poor quality with
impairments and scoring them subjectively. This subjective test process is specified in the standard ITU-T P.800.

MOS
5
4
3
2
1

Mean opinion score (MOS)


Quality
Impairment
Excellent Imperceptible
Good
Perceptible but not annoying
Fair
Slightly annoying
Poor
Annoying
Bad
Very annoying

MOS is valuable for the characterization of any device where the voice is compressed and transmitted over
networks. Such devices may be handsets, mobile phones and networks such as packet-based VoIP networks or
wireless networks. Networks employing state-of-the-art codecs are optimized for the compression of voice and
treat tomes such as DTMF tones in a different way from the voice. Therefore to the characterize quality of the
network or the System under Test, real speech files need to be transmitted. Frequency response, levels, Echo
and delay must be measured, but in addition to the perceived speech quality.
Subjective testing is obviously time-consuming and expensive. So algorithms have been developed to allow a
computer to reference the pure incidented speech file and compare it with the received degraded file and
calculate the MOS with a high degree of correlation to subjective MOS. The first such objective speech quality
measurements were standardized in 1998 with the PSQM algorithm P.861 Objective quality measurement of
telephone-band (300-3400 Hz) speech codecs. This was followed by ITU-T Recommendation P.862: Perceptual

evaluation of speech quality (PESQ) which has gained widespread worldwide usage as a reliable method for
characterizing most narrowband telephony systems.
The following are examples of Mean Opinion Scores for one implementation of different codecs:
Codec

Data rate
[kbit/s]
G.711 (ISDN)
64
iLBC
15.2
AMR
12.2
G.729
8
G.723.1
6.3
GSM EFR
12.2
G.726 ADPCM 32
G.729a
8
G.723.1
5.3
G.728
16
GSM FR
12.2

Mean opinion score


(MOS)
4.1
4.14
4.14
3.92
3.9
3.8
3.85
3.7
3.65
3.61
3.5

Effects of Compression

G.711 64k bps

G.723 6k3 bps

G.729 8k bps

G.723 5k3 bps

Wideband telephony networks are expected to improve the user experience including the intelligibility of voice
conversations over highly compressed codecs as used in both packet and wireless networks. Hopefully the tardy
phrase "can you hear me now" will become a less frequent part of our vocabulary.
High Definition or Wideband Telephony is just now coming into common usage. G.722 is the Wideband Telephony
codec for VoIP and WB-AMR is now being tested for wireless networks. However, they do need to be tested to tune
codec implementations, packet loss concealment algorithms and performance in areas of poor coverage or high
congestion.

Here are the analog frequency definitions for the different forms of telecommunications:

50Hz-3.8kHz Narrowband
50Hz 7.5kHz Wideband
50Hz 14kHz Super-wideband
20Hz 24kHz Full-band

100 Hz

NB
WB
SWB
FB

1kHz

10kHz30kHz

PESQ was never designed to address wideband networks. In addition, vendors of time warping code codecs
[e.g. EVRC] and Skype and iLBC were not content that PESQ accurately measured the full quality of their
codecs.

PESQ Shortcomings

iLBC Subjective > PESQ 0.0-0.4


iSAC Subjective > PESQ 0.0-0.4

In 2006 ITU-T commenced work on a new standard to address the limitations of PESQ and during 2011 the
POLQA standard, Recommendation P.863, was published.
Speech uses the new POLQA speech quality metric for objective protection of MOS. The old PESQ algorithm
(ITU-T P.862) has been used for narrowband telephony since it was approved in 2000. PESQ was not designed
for Wideband Telephony and also did not represent well the speech quality of time warping codecs. POLQA
addresses all these short comings and provides a scale that goes all the way to 24kHz audio.
It is desirable to use the same scale so that laboratories can compare new results for wideband telephony with
their old PESQ database. However, the question of human expectation comes into play because all these
objective measurements performed by computers must correlate or predict subjective experience. If you watch
a video on your smart phone, you might consider the picture quality as being good. Your expectations are put in
the context of the small screen and the convenience of the video being played on a handheld smartphone. If you
would give you the same video on your brand-new expensive high-definition 1080P TV, you would be very
disappointed even if the pixel resolution had been scaled to the 62 inches screen size. Your expectation of
quality is tempered to the format in which you are viewing it.
Similarly with speech and audio. If you were to participate in a MOS test and invited into a studio where there
were high fidelity speakers, orchestral classical music playing and told and asked to rate the quality of the High
Definition speech you are about to hear, your expectations would be set high and you'd be more critical. You
would score the audio lower than if you had been asked to rate the speech quality of your most recent cellular
phone call.

POLQA offers two scales, the narrowband scale and the super wideband scale. Super wideband telephony
reaches 14 kHz analog audio frequency. The narrowband focus scale maps directly onto the old desk scale and
exploits the higher scores not given by test participants in narrowband tests.

NB: Maximum MOS value 4.25


WB: Maximum MOS value 4.5
SWB: Maximum MOS value 4.75

So a score of 4.5, on the narrowband POLQA scale is experimentally the best value you will ever obtain with
wideband telephony equipment. For POLQA, the maximum MOS value in tests is 4.75 .
In future years, the industry will migrate exclusively to using the super wideband POLQA scale as soon as users'
expectations always expect high-definition or hi-fi quality to the communications audio.
POLQA SWB

14kHz 16 bit Linear

4.75

7kHz 16 bit Linear

4.5

AMR - WB

POLQA NB

3.4KHz 16 bit Linear

3.8

4.5

G.711

3.7

4.3

EFR/AMR-FR 12.2kbps

3.6

4.1

EVRC 9.5 kbps

3.4

3.9

EVRC-B 9.5 kbps

3.5

AMR-HR 7.95 kbps

3.4

3.8

Applications for Perceptual Speech Quality Measurements


Perceptual speech quality measurements are used to make End-to-End measurements, for any network where
voice codecs are used to compress the speech or where speech transmission systems or networks may
introduce impairments, such as weak radio signals or multipath or packet loss or packet jitter.
Examples of systems where Perceptual Speech Quality Measurements are valuable:

Codec evaluation
Frame or packet concealment implementation
Headsets combining the digitization of sound
VoIP phone assessment, both softphone and dedicated VoIP phones
Mobile handsets
Digitizing radio & Intercom systems
VoIP networks
Wireless cellular networks
speech enhancement and noise reduction systems
transcoders

A new important feature added to POLQA is its ability to measure the improvement of speech quality for speech
enhancement and noise reduction systems.

iLBC codec measuring 4.21 on the Narrowband POLQA scale

In over 2 decades where these tests have taken place, no statistically significant number of participants ever
scored any speech recording as being excellent or 5.0. The highest score typically obtained in any test was 4.54.
So this measurement for iLBC of 4.21 is a good score for the codec.
For more information on making PESQ and POLQA measurements, ensure you contact only renowned and wellrespected test vendors because the science of speech quality measurements requires expertise and experience
in many different areas - audio, analog electronics as well as computing. It is easy to make a measurement but
care is required to ensure that measurement is accurate correlates to human subjective experience and is put
into the context of the environment, resolution, format etc.

One of the most admired vendors for speech quality metrics is Malden Electronics, available in USA through
Teraquant Corporation www.teraquant.com
More information can be obtained at:
http://en.wikipedia.org/wiki/POLQA

Anda mungkin juga menyukai