Anda di halaman 1dari 23

Digital Audio Coding Dr. T.

Collins
Standard MIDI Files
Perceptual Audio Coding
MPEG-1 layers 1, 2 & 3
MPEG-4
Audio coding has actually been around for hundreds
of years
Traditionally, composers record their music by writing
out the notes in a standard notation






A slightly more modern equivalent example is the
Victorian piano-rolls
Ancient Audio Coding Methods
200 year old example of audio coding
Standard MIDI Files
A piano roll can be efficiently digitally encoded by
recording the time when each note begins and ends
This is what a standard MIDI file does
The MIDI standard (Musical Instrument Digital
Interface) is an internationally agreed language
Standard MIDI files encode
MIDI events/messages e.g. note-on, note-off, etc.
The time delay between each event
As well as encoding note limits, it also allows:
Up to 16 different instruments to be played at once
Transmission of parameters containing key velocity, volume,
modulation etc.
Standard MIDI Limitations
In a MIDI file, it is the instructions to play the
notes that are stored, not the audio itself
The quality of the reproduction depends on
the synthesiser used for playback
Original
recording
Playback on other
synthesisers / sound cards
MIDI vs. Digital Audio
MIDI Digital Audio
Stores instructions to turn
notes on and off
Stores the actual sampled
audio
Very efficient
(typical rate: 1 kbps)
Less efficient
(typical rate: 100 kbps)
Playback quality depends on
the MIDI device
Playback quality is always the
same
Only synthesised instruments
can be used
Any sounds (including speech
and singing) can be recorded
Sampling
Digital audio represents the continuous analogue
audio waveform by a series of discrete samples
The Sample rate must be at least double the
bandwidth of the audio signal
Typical hi-fi sample rates are 44.1 kHz (CD audio) and
48 kHz (DAT tape and DAB radio)
S
o
u
n
d

P
r
e
s
s
u
r
e

L
e
v
e
l

0
Fs/2 Fs
Sample rate
Frequency
Quantisation levels
Each sample is quantised to be represented by a binary integer
The number of bits used to represent each sample sets the
number of quantisation levels
The error between the quantised signal and the original audio is
the quantisation noise
Peak signal-to-quantisation noise ratio using n-bits per sample
can be estimated as:


CD audio uses 16 bit resolution giving a dynamic range of
~96 dB
To hear the quantisation noise, the signal level would be close
to the threshold of pain!
dB 6n SNR
Sub-band Coding
Like the eye, the ear is more sensitive to some frequencies than
others
Many audio coding algorithms exploit this using a form of sub-
band coding
Down-
sample Filters
Digital
audio
in



Quantise
Coded
audio
out
Multiplex
Bit rates: 16x48000
=768 kbps
16x3x48000
=2304 kbps
16x3x16000
=768 kbps
4x3x16000
=192 kbps
Perceptual Coding
A key question when designing a sub-band coder:
What should the quantisation levels of the sub-bands be?
Remember that the quantisation process will
introduce noise and that we want the noise to be
imperceptible
We want the noise to be just below the threshold of
hearing (also known as the Minimum Audible Field,
MAF)
So, the question should be:
What is the MAF in each sub-band?
To estimate this, look at Robinson-Dadson curves
Equal Loudness Curves
16
bits



Quantisation noise
12
bits
Quantisation Implications
S
o
u
n
d

P
r
e
s
s
u
r
e

L
e
v
e
l

[
d
B
-
S
P
L
]

80
70
60
50
40
30
20
10
0
-10
-20
-30
5000 10000 15000
Frequency
[Hz]
Peak Signal
Level
Threshold
of Hearing
9
bits
9
bits
10
bits
10
bits
10
bits
9
bits
10
bits
11
bits
12
bits
11
bits
12
bits
12
bits
Application to Sub-band Coding
S
o
u
n
d

P
r
e
s
s
u
r
e

L
e
v
e
l

[
d
B
-
S
P
L
]

80
70
60
50
40
30
20
10
0
-10
-20
-30
5000 10000 15000
Frequency
[Hz]
Peak Signal
Level
Threshold
of Hearing
Psychoacoustics
Substantial improvements to our sub-band
coder are possible using psychoacoustics
Psychoacoustics is the study of how sound is
perceived by the ear-brain combination
Of interest to us: how the threshold of
hearing is not constant
In fact, the threshold of hearing constantly
changes due to masking
Masking
In the presence of the signal, the noise sounds much
quieter (almost undetectable)
Due to the anatomy of the ear, loud sounds mask
quieter sounds at nearby frequencies
Effectively, the threshold of hearing is raised to the
masking threshold
The masking threshold can be estimated using a
psychoacoustic model and exploited by the coder
Signal Signal + Noise
(SNR = 24 dB)
Noise
The Masking Threshold
S
o
u
n
d

P
r
e
s
s
u
r
e

L
e
v
e
l

[
d
B
-
S
P
L
]

80
70
60
50
40
30
20
10
0
-10
-20
-30
5000 10000 15000
Frequency
[Hz]
Threshold
of Hearing
Masking
threshold
Signal
2
bits
4
bits
4
bits
4
bits
3
bits
2
bits
4
bits
4
bits
5
bits
5
bits
5
bits
5
bits
Applying Masking
S
o
u
n
d

P
r
e
s
s
u
r
e

L
e
v
e
l

[
d
B
-
S
P
L
]

80
70
60
50
40
30
20
10
0
-10
-20
-30
5000 10000 15000
Frequency
[Hz]
Threshold
of Hearing
Average bits per sample = 3.92
Compression ratio = 16:3.92 = 4.1:1
Masking
threshold
Space
Oddity,
Bowie
Frame
used for
example
Additional Side Information
The audio signal is processed in discrete blocks of
samples known as frames
Each frame of each sub-band is:
Scaled to normalise the peak signal level
Quantised at a level appropriate for the current signal-to-
mask ratio
The receiver needs to know the scale factor and
quantisation levels used
This information must be embedded along with the
samples
The resulting overhead is very small compared with
the compression gains
Block Diagrams
Sub-band
filter bank
Scale and
Quantise
Multiplex
and Data
Format
Code Side
Info
FFT
Psycho-
acoustic
model
ENCODER
Digital
Audio
In
De-
Multiplex
Descale &
Dequantise
Inverse
filter bank
Decode
Side Info
DECODER
Digital
Audio
Out
Coded
Audio
In
Coded
Audio
Out
Masking
thresholds
MPEG 1: Layers 1, 2 & 3
Three perceptual coders are available in the MPEG 1
specification
They are know as layers 1, 2 & 3
Layer 1 (.mp1)
Similar to the simple coder just described
32 sub-bands are used
Each frame contains 384 samples (32 x 12)
A version of layer 1 was used in the Digital Compact
Cassette (DCC)
Layer 2 (.mp2)
Slightly more complex but better quality than layer 1
Frame length increased to 1152 samples (32 x 36)
MPEG 1: Layers 1, 2 & 3 (cont)
Layer 2 (cont)
Data formatting of samples and side information is slightly
more efficient
Used in Digital Audio Broadcasting (DAB)
Layer 3 (.mp3)
Significantly more complex than layers 1 or 2
Capable of reasonable quality even at very low data rates
A combination of sub-band coding and transform coding is
used to give up to 576 frequency bands (compared to 32 for
layers 1 & 2)
Huffman encoding is applied to samples
MP3 files now hugely popular for internet and mobile users
Other Perceptual Coders
The same principles are applied in subtly
different ways in most general-purpose audio
coders
E.g.
Real Audio
Microsofts WMA format
MiniDisc (ATRAC)
MPEG-4
In the latest version of MPEG, MPEG-4, the
specification includes:
General audio coders: Similar to MPEG 1 but including multi-
channel support
Parametric coder: HILN (Harmonics, Individual Lines and
Noise) for very low bit rates
Speech coders: HVXC and CELP speech coders
Structured Audio: Similar to MIDI but including instrument
models. Used for synthetic audio.
Synthesised Speech: Allows speech to be coded as text and
resynthesised at the decoder
Summary
Standard MIDI files
Work by encoding the structure of the music
MPEG-1 Layers 1 & 2
Work by removing the perceptual redundancy from digitised
audio
MPEG-1 Layer 3
Removes perceptual redundancy and statistical redundancy
(by entropy coding)
MPEG-4
Coding method can be chosen to suit signal source
Perceptual, statistical and structural redundancy can be
exploited

Anda mungkin juga menyukai