Anda di halaman 1dari 46

www.jntuworld.

com
Audio Coding
Psychoacoustics
S. R. M. Prasanna

Dept of ECE,
IIT Guwahati,
prasanna@iitg.ernet.in

Audio Coding – p. 1/4


www.jntuworld.com
Motivation
Acoustics: Study of sounds
Psychoacoustics: Study of perception of sounds
Deals with characterizing human auditory perception
Particularly time-frequency analysis capabilities of inner
ear
Audio coders achieve significant compression by
exploiting the property that perceptually irrelevant
information cannot be heard
Perceptually irrelevant information is identified by
incorporating several psychoacoustic principles

Audio Coding – p. 2/4


www.jntuworld.com
Human Speech Perception

Figure 1: Cross Section of Human Ear


Audio Coding – p. 3/4
www.jntuworld.com
Functions of Human Ear
Mainly three regions - outer ear, middle ear & inner ear
Outer ear - directs speech pressure variations towards
the middle ear
Middle ear - transforms pressure variations into
mechanical motion
Inner ear - converts mechanical vibrations into electrical
firings in the auditory neurons, which leads to brain
Language decoding and message understanding at the
higher centers of learning in brain which is less
understood

Audio Coding – p. 4/4


www.jntuworld.com
Inner Ear

Figure 2: Figures Related to Inner Ear


Audio Coding – p. 5/4
www.jntuworld.com
Frequency to Place Transformation
Sound waves to mechanical vibrations by middle ear
Mechanical vibrations to traveling waves by inner ear
along the length of basilar membrane
Neural receptors are connected along the length of the
basilar membrane
Traveling waves generate peak responses at frequency
specific membrane positions
Therefore different neural receptors are effectively
tuned to different frequency bands according to their
locations.

Audio Coding – p. 6/4


www.jntuworld.com
Freq. to Place Tfmn. (contd.)
For sinusoidal stimuli, the peak response occurs near
the basilar membrane region with a resonant freq.
equal to input sinusoid freq.
Location of peak is characteristic place for the stimulus
Freq. that best excites a particular place is
characteristic frequency
Thus a frequency to place transformation takes place

Audio Coding – p. 7/4


www.jntuworld.com
Signal Processing Perspective
Bank of highly overlapping band pass filters
Magnitude responses are asymmetric
Bandwidths increase with frequency

Audio Coding – p. 8/4


www.jntuworld.com
Sound Pressure Level (SPL)
A std. metric that quantifies the intensity of an
acoustical stimulus
SPL gives the level (intensity) of sound pressure in dBs
relative to an internationally defined ref. level
LSP L = 20log10 (p/p0 ) (dB)
where LSP L is the SPL of a stimulus p, which is the
sound pressure in pascals and p0 is the std. ref level of
20 µP a
About 150 dB SPL spans the dynamic range of intensity
for human auditory system
Min value is the limit of detection for low intensity (quiet)
stimuli
Max value is the threshold of pain for high intensity
(loud) stimuli
Audio Coding – p. 9/4
www.jntuworld.com
Absolute Threshold for Hearing (ATH)
Amount of energy needed in a pure tone such that it can
be detected by a listener in a noiseless environment
ATH is expressed in dB SPL
ATH is frequency dependent parameter and is given by
Tq (f ) =
2
3.64(f /1000)−0.8 − 6.5e −0.6(f /1000−3.3) + 10−3 (f /1000)4
dB(SP L)
In the context of signal compression, Tq (f ) could be
interpreted naively as a maximum allowable energy
level for coding distortions introduced in the frequency
domain (Fig 5.1 from Spanias book)
Use of ATH to shape the coding distortion spectrum
represents the first step towards perceptual coding.
Audio Coding – p. 10/4
www.jntuworld.com
ATH Diagram

Figure 3: Absolute Threshold for Hearing


Audio Coding – p. 11/4
www.jntuworld.com
Critical Bands (CB)
Critical band is a function of frequency that quantifies
the cochlear filter passbands
CB tends to remain constant (about 100 Hz) up to 500
Hz and increases to approximately 20% of the center
frequency about 500 Hz
For an average listener the critical bandwidth is given
by BWc (f ) = 25 + 75[1 + 1.4(f /100)2 ]0.69 (Hz)
The function
Zb (f ) = 13tan−1 (0.00076f ) + 3.5tan−1 ((f /7500)2 ) (Bark)
is often used to convert frequency in Hz to Bark scale
Nonuniform Hz spacing of the filter bank is actually
uniform on a Bark scale
One critical band (CB) comprises one Bark. (Table 5.1
and Fig. 5.4)
Audio Coding – p. 12/4
www.jntuworld.com
Critical Bands

Figure 4: Table Showing Critical Bands


Audio Coding – p. 13/4
www.jntuworld.com
Mapping from Hz to Bark

Figure 5: Mapping from Hz to Bark Scale


Audio Coding – p. 14/4
www.jntuworld.com
Simultaneous Masking
Masking: One sound is rendered inaudible because of
the presence of another sound
Simultaneous masking: When two or more stimuli are
simultaneously presented to the auditory system
Freq. Domain: Relative shapes of the masker and
maskee magnitude spectra determine to what extent
presence of certain spectral energy will mask the
presence of other spectral energy
Time Domain: Phase relationships between stimuli can
also affect masking outcomes
In simple words presence of a strong noise or tone
masker creates an excitation of sufficient strength on
the basilar membrane at the critical band location to
block effectively detection of a weaker (maskee) signal.
Audio Coding – p. 15/4
www.jntuworld.com
Types of Simultaneous Masking
Noise-Masking-Tone (NMT), Tone-Masking-Noise
(TMN) and Noise-Masking-Noise (NMN)
NMT:
A NB noise (1 Bark) masks a tone within the same
CB, provided intensity of masked tone is below a
predictable threshold
Signal-to-Mask Ratio (SMR) (dB) is the difference
between the intensities of masking and maskee
Min. SMR at the threshold of detection occurs when
maskee freq is close to center freq of masker and
will be about 5 dB

Audio Coding – p. 16/4


www.jntuworld.com
TMN and NMN
TMN:
Pure tone at the center of a CB masks noise of any
subcritical BW, provided noise spectrum is below a
predictable threshold
Min SMR lie between 21 and 28 dB
NMN:
A NB noise masks another NB noise
Min SMR is nearly about 26 dB

Audio Coding – p. 17/4


www.jntuworld.com
Masking Schemes

Figure 6: Masking schemes


Audio Coding – p. 18/4
www.jntuworld.com
Asymmetry of Masking
The NMT and TMN show asymmetry in masking power
between noise masker and tone masker
In spite of both maskers at same db SPL, associated
threshold SMRs differ by 20 dB
Hence the interest in all types of masking
Knowledge of all three is critical to succeed in the task
of shaping coding distortion
For each temporal analysis interval, a codec’s
perceptual model should identify across the freq
spectrum noise-like and tone-like components within
both the audio signal and the coding distortion
Model should then apply appropriate masking
relationships to obtain global masking threshold

Audio Coding – p. 19/4


www.jntuworld.com
Spread of Masking
Simultaneous masking is not bandlimited to within the
boundaries of a single CB
Interband masking also occurs, i.e., a masker centered
within one critical band has some predictable effect on
detection thresholds in other CBs.
This effect is known as spread of masking
A triangular spreading function that has slopes of +25
and -10 dB per Bark.
p
SFdB (x) = 15.81 + 7.5(x + 0.474) − 17.5 1 + (x + 0.474)2
dB
where x in Barks and SFdB (x) is expressed in dB.

Audio Coding – p. 20/4


www.jntuworld.com
Just Noticeable Distortion (JND)
Global masking threshold comprises an estimate of the
level at which quantization noise becomes just
noticeable
Hence global masking threshold is sometimes referred
to as JND

Audio Coding – p. 21/4


www.jntuworld.com
Nonsimultaneous Masking
Also termed temporal masking
Masking phenomenon extends beyond window of
simultaneous stimulus presentation
Masking occurs both prior to masker onset and also
after masker removal
Forward (post) and backward (pre) masking are the two

Audio Coding – p. 22/4


www.jntuworld.com

Figure 7: Temporal Masking

22-1
www.jntuworld.com
Perceptual Entropy
Entropy gives min. no. of bits/sample required to store
or transmit given message block
Johnstan combined notion of psychoacoustic masking
with signal quantization principles to define Perceptual
Entropy (PE).
Perceptual Entropy gives min. no. of bits/sample
required to store or transmit perceptually relevant
information in given audio message block.
While discussing PE, conventional entropy is termed as
statistical entropy.
Statistical entropy employs the statistical properties of
the signal for computing entropy
Perceptual entropy employs both statical and
perceptual properties of signal for computing entropy.
Audio Coding – p. 23/4
www.jntuworld.com
Basis for PE
Masking threshold indicates amount of quantzn. in freq.
dom. without perceptually corrupting signal.
Assume that step size and no. of levels in the quantizer
for each spectral line could be set independently.
Further choice of step size is such that total noise
injected at each frequency corresponds to masking
threshold i.e., min no of quantization levels are used.
Then no. of bits required to encode entire transform
represents min. no. of bits necessary to transmit that
block of the signal.
The total number of bits divided by the no. of samples in
the transform represents per-sample rate.
This per-sample bit rate is Perceptual Entropy of signal.

Audio Coding – p. 24/4


www.jntuworld.com
PE v/s SE
Statistical entropy (SE) exploits signal statistics
Perceptual entropy (PE) exploits signal statistics and
also psychoacoustic masking
No. of quantization levels just to avoid perceptual
distortion due to quantization by exploiting masking
thresholds.

Audio Coding – p. 25/4


www.jntuworld.com
Steps for PE Computation
DFT computation
Finding Masking thresholds
Calculating no. of bits to quantize DFT spectrum

Audio Coding – p. 26/4


www.jntuworld.com
DFT Computation
Windowing and frequency transformation
2048 sample DFT by FFT
1024 are considered for further analysis

Audio Coding – p. 27/4


www.jntuworld.com
Calculation of Masking Threshold
Critical band analysis
Applying spreading function to critical band spectrum
Calculating Masking Thresholds
Accounting for absolute thresholds
Relating spread masking threshold to critical band
masking threshold

Audio Coding – p. 28/4


www.jntuworld.com
Critical Band Analysis
DFT spectrum is complex: S(ω) = Re(ω) + Im(ω)
Power Spectrum: P (ω) = Re2 (ω) + Im2 (ω)
P (ω) is partitioned into CBs
Pbhi
Energy in each CB: Bi = ω=bli P (ω)
Bi represents CB spectrum

Audio Coding – p. 29/4


www.jntuworld.com
Spreading Function (SF)
CB spectrum threshold is also influenced by adjacent
CBs which is accounted using SF.
SF is used to estimate effects of masking across CBs
SF is calculated for abs(j − i) ≤ 25, where i is bark freq
of masked and j is bark freq of masking and placed into
a matrix Sij
Spread CB Spectrum: Ci = Sij ∗ Bi
Effect of spreading function is to spread peaks in Bi and
also raise threshold values, especially at higher
frequencies.

Audio Coding – p. 30/4


www.jntuworld.com
Masking Thresholds
TMN is estimated as 14 + i dB below Ci , where i is bark
freq.
NMT is estimated as 5.5 dB below Ci uniformly across
CB spectrum

Audio Coding – p. 31/4


www.jntuworld.com
Tone Like and Noise Like Components
Spect. Flatness Measure: SF M = GM /AM
GM geometric mean of P (ω) and AM is arithmetic mean
of P (ω)
SF MdB = 10log10 (GM /AM )
Coeff. of tonality: α = min(SF MdB /SF MdBmax , 1)
SF Mdbmax = −60 dB is used to estimate tonality
SF MdB = 0 indicate complete noise like
SF MdB = −30 dB indicates α = 0.5
SF MdB = −75 dB indicates α = 1.0

Audio Coding – p. 32/4


www.jntuworld.com
Offset for Masking Energy
Oi = α(14.5 + i) + (1 − α)5.5 (dB), in each band i
Index α is used to geometrically weight the two
thresholds
Oi is then subtracted from Ci to yield spread threshold
estimate Ti = 10log10 (Ci )−Oi /10
Since spectrum spread fns. do not have normalized
gain, it is normalized by the DC gain for each CB
After normalization, bark thresholds are compared to
absolute thresholds.
Any CB that has bark threshold lower than absolute
threshold is changed to the absolute threshold
This will be the threshold used for computing bit rate.

Audio Coding – p. 33/4


www.jntuworld.com
Calculation of Bit Rate
No. of quantization levels to follow signal in freq domain
Ti is in power d omain
Quantization energy must be spread across ki spectral
lines in each CB
Assuming noise to spread equally across the entire
band, noise energy will be δ 2 /12
Energy at each spectral freq = Ti /ki
Real and imaginary are quantized independently,
= Ti /2ki
2 ′
p
δ /12 = Ti /2Ki =⇒ δ = Ti = (6Ti )/ki
Ti′ is step size.

Audio Coding – p. 34/4


www.jntuworld.com
Computing PE
NRe (ω) = abs(nint(Re(ω)/Ti′ )) and
NIm (ω) = abs(nint(Im(ω)/Ti′ )) for each ω within CB i.
Let N∗ represents actual (integer) quantized value of
each line

If N(ReorIm) (ω) = 0, then N(ReorIm) (ω) = 0

If N(ReorIm) (ω) 6= 0, then N(ReorIm) (ω) = log2 (2N∗ (ω) + 1)
This operation assigns a bit rate of zero bits to any
signal with an amplitude that does not need to be
quantized and assigns a bit ate of log2 (no.of levels) to
those that must be quantized.
Pπ ′ (ω) + N ′ (ω))
Total bit rate = ω=0 (NRe Im
Rate per sample, P E = T otalbitrate/2048
Audio Coding – p. 35/4
www.jntuworld.com
Example codec perceptual model
ISO/IEC 11172-3 (MPEG-1) Psychoacoustic Model-1
Determines max. allowable quantization noise energy
in each CB such that it remains inaudible.
Blocking i/p audio into frames
High resolution spectral computation for each frame
For each frame tonal and noise maskers estimation
Decimation and reorganization of maskers
Calculation of individual masking thresholds for
components in each CB
Calculation of global masking thresholds for each CB

Audio Coding – p. 36/4


www.jntuworld.com
Spectral Analysis
512 point DFT computation
Power Spectral Density (PSD) P (k) estimation, where
k = 1, 2, . . . , 512

60
50
SPL (dB)

40
30
20
10
0
−10
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
Frequency (Hz)

Audio Coding – p. 37/4


www.jntuworld.com
Identn. of Tonal and Noise Maskers
P (k) where k = 1, 2, . . . , 256 are considered
Local maxima in PSD within a certain Bark by at least 7
dB are classified as tonal
Tonal set ST is defined as

ST = P (k)|P (k) > P (k ± 1)&P (k) > P (k ± ∆k ) + 7dB

where

∆k ∈ 2 2 < k < 63(0.17 − 5.5kHz)


∆k ∈ [2, 3] 63 ≤ k < 127(5.5 − 11kHz)
∆k ∈ [2, 6] 127 ≤ k ≤ 256(11 − 20kHz)

Audio Coding – p. 38/4


www.jntuworld.com
Tonal and Noise Maskers (contd.)
Tonal maskers PT M (k), are computed from spectral
peaks listed in ST :
1
X
PT M (k) = 10log10 100.1P (k+j) (dB)
j=−1

For each neighborhood max, energy from three


adjacent peaks combined to form a single tonal masker
For each CB, PN M (k̄) a single NM is then computed
from (remaining) spectral lines not within the ±∆k
neighborhood of a tonal masker
X using the sum
PN M (k̄) = 10log10 100.1P (j) (dB)
j
∀P (j) 6= PT M (k, k ± 1, k ± ∆k )

where k̄ is geometric mean spectral line of CB


Audio Coding – p. 39/4
www.jntuworld.com
Decimation of Maskers
No. of maskers are reduced using two criteria
First, any tonal or noise maskers below abs. threshold
are discarded, i.e., PT M,N M (k) ≥ Tq (k) are retained.
Next, a sliding 0.5 Bark-wide window is used to replace
any pair of maskers occurring within a distance of 0.5
Bark by the stronger of the two.
Masker freq. bins are reorganized using the decimation
scheme

PT M,N M (i) = PT M,N M (k)


PT M,N M (k) = 0

Audio Coding – p. 40/4


www.jntuworld.com
Decimation (contd.)

i = k, 1 ≤ k ≤ 48
i = k + (kmod2) 49 ≤ k ≤ 96
i = k + 3 − ((k − 1)mod4) 97 ≤ k ≤ 232

Net effect is 2 : 1 decimation of masker bins in CBs


18-22
4:1 decimation of masker bins in CBs 22-35
With no loss of masking components.
Decimation reduces total no. of tone and noise masker
freq. bins under consideration from 256 to 106

Audio Coding – p. 41/4


www.jntuworld.com
Individual Masking Thresholds
Using decimated set of tonal and noise maskers,
individual tone and noise masking thresholds are
computed
Each individual threshold represents a masking
contribution at freq. bin i due to the tone or noise
masker located at bin j
Tonal Masking Threshold, TT M (i, j) is given by
TT M (i, j) = PT M (j)−0.2757zb (j)+SF (i, j)−6.025(dbSP L)
where, PT M (j) is SPL of tonal masker in freq. bin j ,
zb (j) Bark freq of bin j and SF (i, j) is spreading of
masking from bin j to bin i
Noise Masking Threshold, TN M (i, j) is given by
TN M (i, j) = PN M (j)−0.175Zb (j)+SF (i, j)−2.025(dbSP L)
where, PN M (j) is SPL of noise masker in freq bin j
Audio Coding – p. 42/4
www.jntuworld.com
Global Masking Thresholds
Individual masking thresholds are combined to estimate
a global masking threshold for each freq. bin
0.1T q (i)
PL
Tg (i) = 10log10 (10 + l=1 100.1TT M (i,l) +
PM 0.1TN M (i,m) )(db, SP L) where, L and M are the
m=1 10
number of tonal and noise maskers, respectively.
The number of bits are allocated based on the global
masking thresholds and is termed as perceptual bit
allocation.

Audio Coding – p. 43/4


www.jntuworld.com
Expt. 5-AC- Audio Synthesis using MSE
Problem No. 2.25 (pp. 49) of Spanias book on Audio
Signal Processing

Audio Coding – p. 44/4


www.jntuworld.com
Expt. 6-AC- Audio Synthesis using Psychoacoustics

Problem No. 5.11 (pp. 142) of Spanias book on Audio


Signal Processing

Audio Coding – p. 45/4

Anda mungkin juga menyukai