Anda di halaman 1dari 6

Speech Intelligibility

In terms of individual communication, speech is probably the most important and efficient means,
even in today's multi-media society. Experimental tests by Chapanis show that, as would be
expected, the performance time of cooperative tasks performed in groups was up to ten times faster
when speech was allowed compared to when it was not. Thus, with many rooms being used solely
for speech between individuals and groups, it is important that acoustic designs accommodate and
enhance such use.

Measuring Intelligibility
The intelligibility of speech refers to the accuracy with which a normal listener can understand a
spoken word or phrase. Given the fact that some of the information communicated through speech
is contained within contextual, visual and gestural cues, it is still possible to understand meaning
even if only a fraction of the discrete speech units are heard correctly. However, in large auditoria
and places where reproduced speech is used, the listener has limited access to these cues and
must rely more heavily upon the sound actually produced by the mouth.

Research into this area began with the development of telephone and telecommunication systems
in the early part of this century. A product of this research was a quantitative measure for
intelligibility based on articulation testing. This procedure (as described by Lochner and Burger)
normally consists of an announcer reading out lists of syllables, words or sentences to one or more
listeners within the test enclosure. The percentage of these correctly recorded by the listeners is
called the articulation score and is then taken as an 'in-situ' measure of the speech intelligibility of
that enclosure.

The science of articulation testing was substantially refined at Bell Telephone Laboratories and later
at the Psycho-Acoustic Laboratory at Harvard University. From this later work, a set of phonetically
balanced, mono-syllabic test lists were prepared, called the Harvard P.B.50 word score. In order to
negate any influence of non-phonetic cues on the measured intelligibility, these word lists comprise
only of meaningless or jumbled syllables. Thus, in order to be correctly recorded by the listener,
each consonant and vowel sound must be clearly audible. As a further measure, many tests are
conducted with the syllables embedded in a carrier phrase in an attempt to simulate fluent speech.
There are now many derivations of this methodology (such as the Fairbanks rhythm method used
by Bradley and Latham), however, the resulting value is a percentage score of correctly recorded
syllables. Thus the degree of intelligibility is considered to correlate with the average of these
scores. This percentage becomes the measured speech intelligibility rating for that particular
enclosure.

As stated before, normal connected speech can be understood even if some of the syllables are
unintelligible. This is due to the fact that the listener can deduce the meaning from the context of the
sentence. However, even under perfect conditions, the maximum word score normally attainable is
about 95% due to unavoidable errors. A word score of 80% enables the audience to understand
every sentence without due effort. In a room where the word score is closer to 70%, the listener has
to concentrate to understand what is said whilst below 60% the intelligibility is quite poor.

Predicting Intelligibility
There are several available methods of predicting Speech Intelligibility within an enclosure. These
include the articulation index (AI), the speech interference level (SIL), the A-weighted signal-to-
noise ratio (Lsa), useful/detrimental sound ratio's (U80 and U95) and the speech transmission index
(STI). Each of these methods is based on the same fundamental principle, determining a ratio
between the received speech signal and the level of interfering noise. It is this basic signal-to-noise
relationship upon which speech intelligibility is deemed to depend - the higher the ratio, the greater
the intelligibility.

For the purpose of these lectures, speech is considered only as that recognisable vocal information
necessary for the correct interpretation of specific speech units. All other sound energy reaching the
listener is considered to be interfering noise. Thus a number of signal-to-noise ratios may be
measured from the same signal but at different frequency bands. There are basically three
measurable factors which influence these signal-to-noise ratios;

The level and manner of the speech output,


The level and spectrum of background noise and
The nature and duration of the room response.

Speech Level
In a disturbance-free environment, normal speech levels fall between 55 and 65 dB (measured at a
distance of 1m from the speaker). In specific situations, levels may reach as high as 96 dB when
shouting a warning, or as low as 30 dB when whispering softly. It should be noted that speech
levels vary significantly between individuals, even in similar acoustic conditions. Thus, when making
predictions of speech intelligibility, it is often wise to base them on worst-case values rather than the
average. This can be done by tempering the average value by an amount representing one
standard deviation of inter-individual speech levels. Even this does not fully account for the variation
at extremes of effort. As can easily be shown (Pearsons et al), up to 15% of women cannot raise
their voice above 75 dB whilst 15% of shouting men can easily exceed 96 dB, sometimes reaching
as high as 104 dB. The following table presents 'worst-case' values for vocal effort.

Vocal Effort dB(A) Vocal Effort dB(A)

Whispering 32 Raised 57

Soft 37 Loud 62

Relaxed 42 Very Loud 67

Normal (private) 47 Shouting 72

Normal (public) 52 Max. Shout 77

Table of average vocal effort and sound level.

When calculating intelligibility, it has been shown that loud or shouted speech is more difficult to
understand, regardless of the level at the listeners ear. This is due mainly to changes in phonetics
and intonation, becoming noticeable above 75 dB. Additionally, if, for more normal speech, the level
arriving at the ear is very high (greater than 80 dB), there is research to indicate an overloading of
the ear. This may occur as a result of a very short speaker-listener distance and results in a further
decrease.

For the purposes of simplified calculation, these phenomenon may be summarised in the following
manner. Firstly for loud speech, for every 10 dB rise in output level above 75 dB (as measured 1m
from the speaker), the signal-to-noise ratio between received speech and interfering noise should
be reduced by 4 dB. Secondly for normal speech, it is assumed that output levels between 45 and
75 dB and received levels below 80 dB have no such noticeable effect. The effects of extreme
proximity calculate out to an approximate 3-5% reduction in intelligibility for every 10 dB above 80,
less than a 1dB reduction in the signal-to-noise ratio (assuming moderately low levels of interfering
noise).

Background Noise
Within every acoustic environment, there is always a certain level of ambient background noise
present. The level of this is mostly dependant on the activities taking place within the space and its
more immediate surrounds. The most obvious effect of background noise is that it masks the
speech signal, thus reducing the signal-to-noise ratio as the receiver must specifically concentrate
on the speech. There is, however, another effect of background noise.
As is obvious, short duration speech may entail significant vocal effort whereas conversations of a
longer duration require a much lower level in order that it be comfortably sustained by the speaker.
This, however, explains only one aspect of a speaker choice of speech level.

One of the most important determinants of speech level is what is termed the Lombard effect. This
effect (originally noted by Lombard and studied further by Lane and Tranel) is most clearly
illustrated when required to speak whilst listening to headphones. In general, a speaker checks his
vocal effort using feedback from his own hearing and the exertion of muscles participating in the
speech process. With headphones obscuring the ears and the music masking much of the
feedback, the voice is almost automatically raised to compensate. Lane and Tranel showed that
background and other interfering noise have much the same effect, as do both temporary and
permanent hearing loss.

Quantifying this effect is difficult as the individual response is often complex. For example, in normal
conversation, rather than raising their voice, participants are more likely to move closer together.
However a rule-of-thumb relationship (as suggested by Lazarus) is that every 1 dB increase in
interfering noise above 45 dB will result in an average rise in output speech level of 0.5 - 0.6 dB.
This automatic rise does not normally occur at softer speech levels as these are more likely to
mean individual or face-to-face conversations. In these situations, the clear preference is almost
always to move closer together meaning that this effect is only applied to speech levels of 55 dB or
higher.

Room Response
The nature of the room response can significantly effect speech intelligibility. This influence,
whether beneficial or detrimental, is a function of the impulse response. In general, the enclosure
will enhance the perception of speech when the amount of energy reaching the listener within the
speech integration period (35-50 msec) is relatively high. Given that ambient background noise is
constantly being reflected about the enclosure, any additional early speech reflections will
effectively increase the apparent signal-to-noise ratio. However, late arriving reflections and
excessive reverberation actually contribute to the apparent background noise level by interfering
with the direct speech signal. Thus, too much late sound energy will tend to reduce the apparent
signal-to-noise ratio.

Measures of Speech Intelligibility

The A-weighted Signal/Noise Ratio (SNA)

This is probably the simplest and easiest to apply of all the methods proposed. Simply determined,
this measurement relates to the difference between the A-weighted long-term average speech level
and the A-weighted long-term average level of background noise, measured over any particular
time.

SNA = LSA - LNA

Measurements of speech intelligibility against this ratio made by Bradley suggest a levelling off or
plateauing effect above +15 dB. This agrees with other research, where signal-to-noise ratios
greater than +15 dB showed only negligible additional improvement in intelligibility, thus indicating
+15 dB as a measure to be aimed for.

The Articulation Index (AI.)

This value is basically a linear measure ranging from 0.0 to 1.0 based on calculations of the signal
to noise ratios in five octave bands (with centre frequencies of 0.25, 0.5, 1, 2 and 4 kHz). It is
possible to obtain a more accurate calculation based upon 1/3rd octave band sound pressure levels
(based on work by Kryter), however, this requires more detailed knowledge of both the speech and
noise spectrums. Since the speech level usually refers to the long term value for normal speakers,
octave spectra are normally sufficient for simple calculations.
Calculation of the AI. consists of three basic steps.

The measurement of the effective signal-to-noise ratio for each octave band.
Applying a weighting factor to each ratio and clipping to ensure that maximum contributions
occur at +18dB and minimum at -12dB.
Calculating the average value.

Thus the articulation index can be calculated from;

where G[i] represents the weighting factor for each octave band

Frequency Weighing
(Hz) Factor (G[i])
250 0.072
500 0.144
1000 0.222
2000 0.327
4000 0.234

Speech Transmission Index (STI)

First introduced as a measure of speech intelligibility by Houtgast and Steeneken, the derivation of
the speech transmission index is basically a much more detailed version of the articulation index.
One of the more important improvements is that an attempt is made to include distortions in the
time domain.

As discussed earlier, these distortions result from reverberation and delayed reflections. Their
principle effect is a tendency to smooth out fast fluctuations in the intensity of a speech signal with a
secondary effect of increasing the effective interference level of background noise. Thus, a criteria
for the transmission of speech was established where speech was regarded as a flow of sound
energy with temporal variations in intensity and spectrum. The degree to which these fluctuations
were preserved by the transmission system was therefore considered a measure of its faithfulness.

Steeneken and Houtgast argued that the preservation of the temporal envelope of a speech signal
implied the preservation of it's individual sinusoidal components. This reasoning, they suggested,
leads to the determination of a rooms acoustic merits by the extent to which sinusoidal intensity
modulations, produced by the speaker, are still present at the receiver.

Thus, the long term average speech level is replaced by a theoretical test signal with a spectrum
similar to normal speech. The intensity of this signal is then modulated by a function with a
modulation index of 1. In this way, any degradation of the signal will appear as a reduction in the
modulation index derived from the received signal.

Given that, for any average enclosure, the rate of reverberant decay is relatively stable, it's effect on
slower modulations of the signal will be different from faster modulations. A study of normal speech
showed that modulation frequencies between 1 and 8 Hz are strongly represented, peaking at 3 Hz.
Steeneken and Houtgast therefore suggest a faithful transmission should represent 0.4 to 20 Hz in
order to ensure excellent intelligibility of both very slow and very fast speech.
If F represents the modulation frequency (ranging from 0.4 to 20 Hz in 18 x 1/3rd octave steps),
and t is the time in seconds, then the intensity of the test signal is modulated by the following
function: 1+cos(2Ft).

Relating this to the impulse model, the following relationship is given for the modulation index.

where c is the speed of sound and refers to the respective attenuation factor with rn being the
relative path length of the nth impulse. The impulse response is then applied to this formula 18
times, once for each value of F. This set of 8 modulation indexes can then be referred to as
representing the Modulation Transfer Function (MTF) of an enclosure. The current calculation,
however, does not consider the effects of background noise. This can be included by modifying the
calculated modulation index by a ratio of the intensity of the background noise (In) to the average
intensity of the received signal, measured at the listener;

Within this one function, the effects of reverberation and the direct field are all represented, whilst
allowing an easy translation into an apparent signal-to-noise ratio;

After clipping each individual ratio to a value between +15 and -15 dB, an average apparent overall
signal to noise ratio is obtained from the 18 equally weighted values. The STI thus represents a
linearised ratio in the following form;

In a manner similar to the calculation of the AI, the apparent signal-to-noise ratios may be derived
for many different frequency bands in order to produce a comprehensive data matrix. Earlier
methodologies suggested a subdivision of the audio scale into 7 octave bands, centred at 0.125,
0.25, 0.5, 1, 2, 4, and 8 kHz. Corresponding to the model in use here, however, the derived STI
value is based only on the single frequency band currently of interest. Given that the STI is a
derived average, consecutive values at different frequencies within the same enclosure may simply
be averaged to provide an overall index.

By using empirical measurements obtained from actual test signals, the STI of any existing room
can easily be measured. This forms the basis of the portable RASTI system used in the real-time
determination of speech intelligibility.

Useful/Detrimental Ratio's

These measures are essentially an early to late sound ratio, similar to those discussed earlier, with
the effects of background sound energy added to the late arriving sound. Lochner and Burger first
introduced the concept of such a ratio as a predictor of speech intelligibility scores based on useful
energy calculated from a weighted sum of the sound energy arriving in the first 95 msec. The
detrimental energy was therefore that sound energy arriving later, with the ambient background
noise added in. In order to derive the useful/detrimental energy, an early to late ratio must first be
established. The Lochner and Burger form, referred to as C95 is given by the following equation;

where m is the fraction of energy of each individual reflection integrated into the useful energy sum.
The calculation of m is quite tricky as it is based on subjective threshold observations. The basic
assumption is that for any specific impulse earlier than 95 msec, a portion of its energy will be
integrated with the direct sound. This portion, varies as a function of both relative delay and relative
level. An approximation of the value of m is given by Bradley as:

where a represents the relative amplitude of the reflection and t its relative delay. From this
early/late ratio, the useful/detrimental ratio (U95) may be determined in the following manner;

where EBL and ESL are the total background and speech energies given by EBL=10(BL/10) and
ESL=10(SL/10) respectively (SL and BL being the long-term, steady-state rms background and
speech levels). Using the above equation, Bradley suggests that other early to late ratios can be
applied similarly, for example U80 derived from C80. The benefit of these values is that they are
simpler to calculate as they do not require the complex weighting procedure as described by
Lochner and Burger. From a plot of measured speech intelligibility against measured U95 and U80
values, Bradley derived the following relationships from the best-fit third order polynomial for each
data group;

SI = 0.7348U95 - 0.09943U95 + 0.0005457U95 + 197.39 and;

SI = 1.219U80 - 0.02466U80 + 0.00295U80 + 95.65

From an investigation of the relative merits of these and other measures, Bradley suggests
that U80 seemed a safer and generally more reliable predictor of intelligibility.

Anda mungkin juga menyukai