Sie sind auf Seite 1von 37

Digital Audio Compression

4/5/2004

Nguyen Chan Hung - Hanoi University of


Techonology

MPEG Audio: Specifications

MPEG-1 (ISO/IEC 11172-3) provides:


Single-channel ('mono') and two-channel ('stereo' or 'dual mono')
coding of digitized sound waves at 32, 44.1, and 48 kHz
sampling rate.
The predefined bit-rates range from 32 to 448 kbit/s for Layer I,
from 32 to 384 kbit/s for Layer II, and from 32 to 320 kbit/s for
Layer III.
MPEG-2 BC (ISO/IEC 13818-3) provides:
A backwards compatible (BC) multi-channel extension to
MPEG-1

4/5/2004

Up to 5 main channels plus a 'low frequent enhancement' (LFE)


channel can be coded
The bit-rate range is extended up to about 1 Mbit/s;

An extension of MPEG-1 towards lower sampling rates 16,


22.05, and 24 kHz for bitrates from 32 to 256 kbit/s (Layer I)
and from 8 to 160 kbit/s (Layer II & Layer III).
Nguyen Chan Hung - Hanoi University of Techonology

MPEG Audio: Specifications (2)

MPEG-2 AAC (ISO/IEC 13818-7) provides

MPEG-4 (ISO/IEC 14496-3) provides

A very high-quality audio coding standard for 1 to 48 channels at sampling


rates of 8 to 96 kHz, with multichannel, multilingual, and multiprogram
capabilities.
AAC works at bitrates from 8 kbit/s for a monophonic speech signal up to in
excess of 160 kbit/s/channel for very-high-quality coding that permits
multiple encode/decode cycles.
Three profiles of AAC provide varying levels of complexity and scalability.
Coding and composition of natural and synthetic audio objects,
Scalability of the bitrate of an audio bitstream,
Scalability of encoder or decoder complexity,
Structured Audio: A universal language for score-driven sound synthesis
TTSI: An interface for text-to-speech conversion systems.

MPEG-7 (ISO/IEC 15938) will provide

4/5/2004

Standardized descriptions and description schemes of audio structures


and sound content,
A language to specify such descriptions and description schemes.

Nguyen Chan Hung - Hanoi University of Techonology

Related specifications

MUSICAM

ASPEC

Adaptive Spectral Perceptual Entropy Coding


Designed for high degrees of compression to allow audio
transmission on ISDN

NICAM 728

Masking pattern adapted Universal Sub-band Integrated


Coding And Multiplexing
Designed to be suitable for DAB (Digital Audio Broadcasting)

Used for European PAL television audio

Dolby AC-3

4/5/2004

Design for ATSC Digital TV


Nguyen Chan Hung - Hanoi University of Techonology

Background of audio compression

Audio compression takes advantage of two facts.

First, in typical audio signals, not all frequencies are


simultaneously present.
Second, because of the phenomenon of masking, human
hearing cannot perceive every detail of an audio signal.

Audio compression splits the audio spectrum into


bands by filtering or transforms, and includes
less data when describing bands in which the level
is low.
Where masking prevents or reduces audibility of
a particular band, even less data needs to be sent.

4/5/2004

Nguyen Chan Hung - Hanoi University of Techonology

Background of audio compression (2)

Audio compression is relatively harder than video compression


because of the acuity of hearing.
1- Masking:
Masking only works properly when the masking and the masked
sounds coincide spatially.
Spatial coincidence is always the case in mono recordings but
not in stereo recordings, where low-level signals can still be
heard if they are in a different part of the soundstage.
Consequently, in stereo and surround sound systems, a
lower compression factor is allowable for a given quality.
2. Speakers quality:
Delayed resonances in poor loudspeakers actually mask
compression artifacts.
Testing a compressor with poor speakers gives a false result,
signals which are apparently satisfactory may be disappointing
when heard on good equipment.

4/5/2004

Nguyen Chan Hung - Hanoi University of Techonology

The characteristics of human hearing

The top figure shows that the threshold


of hearing is a function of frequency.
Naturally, the greatest sensitivity is
in the speech range.
The bottom figure shows the hearing
threshold in the presence of a single
tone.

Note that the threshold is raised for


tones at higher frequency and to some
extent at lower frequency masking
effect.

A complex input spectrum, such as


music, raises the threshold at nearly all
frequencies.
As a result, the hiss from an analog audio
cassette is only audible during quiet
passages in music.

4/5/2004

Nguyen Chan Hung - Hanoi University of Techonology

The characteristics of human hearing (2)

A sound must be present for


at least about 1 millisecond
before it becomes audible.

4/5/2004

Because of this slow


response, masking can still
take place even when the two
signals involved are not
simultaneous.
Forward and backward
masking occur when the
masking sound continues to
mask sounds at lower levels
before and after the masking
sound's actual duration.

Nguyen Chan Hung - Hanoi University of Techonology

Masking

Masking raises the threshold of hearing,


compressors take advantage of this effect by
raising the noise floor, which allows the audio
waveform to be expressed with fewer bits.
The noise floor can only be raised at frequencies at
which there is effective masking.
To maximize effective masking, it is necessary to
split the audio spectrum into different frequency
bands to allow introduction of different amounts of
companding and noise in each band.

4/5/2004

Nguyen Chan Hung - Hanoi University of Techonology

MPEG Audio: General encoder model


Input

Sub-band
Filter

Bit
Allocation

Bit-stream
Generation

Output

Compute
Masking

4/5/2004

Nguyen Chan Hung - Hanoi University of Techonology

10

MPEG Audio encoding algorithm

Use sub-band filters to divide the audio signal into


32 frequency sub-bands that approximate the 32
critical bands.
Determine amount of masking for each band caused
by nearby band using the psychoacoustic model.
If the power in a band is below the masking
threshold, ignore it. Otherwise, determine number
of bits needed to represent the coefficient such that
noise introduced by quantization is below the
masking effect (1 bit 6 dB).
Generate bitstream

4/5/2004

Nguyen Chan Hung - Hanoi University of Techonology

11

MPEG Audio: Coding example


Band

1 2

10

11

12

13

14

15

16

Level (db)

0 8

12

10

10

60

35

20

15

After analysis, the levels of the first 16 of the 32 bands are:


The level of the 8th band is 60dB. If it gives a masking of 12
dB in the 7th band, 15dB in the 9th, then

Level in 7th band is 10 dB ( < 12 dB ), ignore it.


Level in 9th band is 35 dB ( > 15 dB ), encode it.

4/5/2004

Can encode with up to 2 bits (= 12 dB) of quantization error.

Nguyen Chan Hung - Hanoi University of Techonology

12

Sub-band coding (SBC) - Companding

The Figure shows a band-splitting compandor.


The band-splitting filter is a set of narrow-band,
linear-phase filters that overlap and all have the
same bandwidth.
The output in each band consists of samples
representing a waveform.
In each frequency band, the audio input is
amplified up to maximum level prior to
transmission.
Afterwards, each level is returned to its correct
value.
Noise picked up in the transmission is reduced in
each band.
If the noise reduction is compared with the
threshold of hearing, it can be seen that greater
noise can be tolerated in some bands because
of masking.
Consequently, in each band after companding,
it is possible to reduce the wordlength of
samples.
This technique achieves a compression
because the noise introduced by the loss of
resolution is masked.

4/5/2004

Nguyen Chan Hung - Hanoi University of Techonology

13

MPEG Audio Layer I

The figure shows a simplified bandsplitting coder used in MPEG Layer I.


The digital audio input is fed to a bandsplitting filter that divides the spectrum of
the signal into a number of bands (32 bands).
The time axis is divided into blocks of equal length.
In MPEG Layer I, this is 384 input samples, so in the output of the filter there are
12 samples in each of 32 bands.
Within each band, the level is amplified by multiplication to bring the level up
to maximum.
The gain required is constant for the duration of a block
A single scale factor is transmitted with each block for each band in order
to allow the process to be reversed at the decoder.

4/5/2004

Nguyen Chan Hung - Hanoi University of Techonology

14

MPEG Audio Layer I (cont.)

The filter bank output is also analyzed to determine the


spectrum of the input signal.
This analysis drives a masking model that determines the
degree of masking that can be expected in each band.
The more masking available, the less accurate the samples in
each band can be.
The sample accuracy is reduced by requantizing to reduce
wordlength.
This reduction is also constant for every word in a band, but
different bands can use different wordlengths.
The wordlength needs to be transmitted as a bit allocation
code for each band to allow the decoder to deserialize the bit
stream properly.

4/5/2004

Nguyen Chan Hung - Hanoi University of Techonology

15

MPEG Level 1 audio bit stream

The top Figure shows an MPEG Level 1 audio bit stream,


which includes:

Synchronizing pattern and the header,


32 Bit allocation codes of four bits each.

32 scale factors used in the companding of each band.

4/5/2004

These codes describe the wordlength of samples in each


subband.
These scale factors determine the gain needed in the decoder to
return the audio to the correct level.

Audio data in each band.


Nguyen Chan Hung - Hanoi University of Techonology

16

MPEG Layer I decoder

The synchronization pattern is detected by the timing


generator, which deserializes the bit allocation and scale
factor data.
The bit allocation data then allows deserialization of the
variable length samples.
The requantizing is reversed and the compression is
reversed by the scale factor data to put each band back to
the correct level.
These 32 separate bands are then combined in a combiner
filter which produces the audio output.

4/5/2004

Nguyen Chan Hung - Hanoi University of Techonology

17

MPEG Audio: The concept of Layers


Compression Ratios (Original bitrate is 1,4 Mbps of CD quality audio)
1:4

by Layer 1 (corresponds to 384 kbps for a stereo signal),

1:6...1:8

by Layer 2 (corresponds to 256..192 kbps for a stereo signal),

1:10...1:12

by Layer 3 (corresponds to 128..112 kbps for a stereo signal),

Three layers in MPEG audio: Layer I, II, III

4/5/2004

Basic model is similar.


CODEC complexity increases with each layer.
Encoder of higher layer can decode stream of lower layer (e.g.
Layer III decoder can decode Layer II stream, etc)
Psychoacoustic model is used to determine bit allocation to each
subband.

Nguyen Chan Hung - Hanoi University of Techonology

18

MPEG Audio: Filter type

Layer I: DCT type filter with one frame and equal


frequency spread per band

Layer II: Use three frames in filter (total 1152


samples)

Psychoacoustic model only uses frequency masking.

Psychoacoustic models a little bit of the temporal masking.

Layer III: Better critical band filter is used (nonequal frequencies)

4/5/2004

Psychoacoustic model includes temporal masking effects.


Takes into account stereo redundancy.
Uses Huffman coder.
Nguyen Chan Hung - Hanoi University of Techonology

19

MPEG-1 Audio Encoder (Layer I & II)


SF = Scale factor
R = Rate
SMR = Signal to Mask Ratio

32 subbands
0 to 31

Scaler

Quantizer

SMRn
Psychoacoustic
model

4/5/2004

Scale
factor
encoder

Bit-rate
allocation

Rn

Quantized
sample
encoder

Multiplexer

PCM
input

Analysis
filter bank

SFn

Output

Bit-rate
allocation
encoder

Nguyen Chan Hung - Hanoi University of Techonology

20

MPEG-1 Audio Encoder (cont.)

The input audio stream passes through a filter bank that divides
the input into multiple subbands of frequency.
The input audio stream simultaneously passes through a
psychoacoustic model that determines the ratio of the signal
energy to the masking threshold for each subband.
The bit- or noise allocation block uses the Signal-to-Mask
Ratios to decide how to apportion the total number of code
bits available for the quantization of the subband signals to
minimize the audibility of the quantization noise.
Finally, the multiplexer takes the representation of the quantized
subband samples and formats this data and side information into
a coded bitstream.
Ancillary data not necessarily related to the audio stream can
be inserted within the coded bitstream.

4/5/2004

Nguyen Chan Hung - Hanoi University of Techonology

21

MPEG Audio: Subband sample grouping


Subband
filter 1
Audio
samples
in

Subband
filter 2
.
.
Subband
filter 31

12
samples

12
samples

12
samples

12
samples

12
samples

12
samples

12
samples

12
samples

12
samples

Layer I
frame

4/5/2004

Layer II, III


frame

Layer I: 12 * 32 = 384 samples,


Layer II, III: 12* 3* 32 = 1152 samples
Nguyen Chan Hung - Hanoi University of Techonology

22

Psychoacoustic model: Layer I & II


512 or 1024
frequencies

PCM
input

Compute
quiet
threshold

Fast
Fourier
Transform
(FFT)

Tonal/
tonal
nontonal
separator

non
tonal
Compute

signal
power

Sn

Compute
tonal
masking
threshold
function
Compute
nontonal
masking
threshold
function

Masking
threshold
function

Calculate
Minimum

Mn
SMRn

The separator identifies and separates the tonal and noiselike components (non-tonal) of the audio signal because the
masking abilities of the two types of signal differ.

4/5/2004

Nguyen Chan Hung - Hanoi University of Techonology

23

MPEG-1 Audio Layer III Encoder (mp3)


Sub-subbands Scale_factors
32 subbands
0 to 31
Scaler

Quantizer

SMRn

Psychoacoustic
model

Calculate windows sizes,


Scale factor bands,
Bit rate allocation
and quantization taking
buffer fullness into account

Buffer
fullness

Multiplexer

MDCT

Quantized
sample
Huffman
encoder

Buffer

PCM
input

Analysis
filter bank

Scale
factor
encoder

Output

Side
information
encoder
Side
information

4/5/2004

Nguyen Chan Hung - Hanoi University of Techonology

24

Frame formats of 3 layers


Layer I

Layer II

Layer III

Header

CRC

Bit Allocation

Scale factor

Samples

Ancillary

(32)

(0,16)

(128,256)

(0-384)

Header

CRC

Bit Allocation

SCFSI

Scale factor

(32)

(0,16)

(128,256)

(0-60)

(0-384)

Header

CRC

Side information

Main Data

Ancillary

(32)

(0,16)

(136, 256)

(may belong to other frames)

data

data
Samples

Ancillary
data

SCFSI = Scale Factor Selection Information


Side Information of mp3 frame = 17 bytes (136 bits) in single
channel mode and 32 bytes (256 bits) in dual channel mode.
CRC is optional
While Layer I contains only 384 samples, Layer II and Layer III
contains 1152 samples
Main data of mp3 may contain data of neighbor frames (See next
slide)

4/5/2004

Nguyen Chan Hung - Hanoi University of Techonology

25

MP3 frame

The main data section contains the coded scale factor values
and the Huffman coded frequency lines
Its length depends on the bitrate and the length of the ancillary
data.
The length of the scale factor part depends on whether scale
factors are reused, and also on the window length (short or long).
The scale factors are used in the requantization of the
samples
The demand for Huffman code bits varies with time during the
coding process.
The variable bitrate format can be used to handle this, but a fixed
bitrate is often required for an application such as broadcasting
Therefore there is also a bit reservoir technique that allows
unused main data storage in one frame to be used by up to
two consecutive frames

4/5/2004

Nguyen Chan Hung - Hanoi University of Techonology

26

MP3 frame Bit Reservoir

The design of the Layer III bitstream better fits the encoder's time
varying demand on code bits.
As with Layer II, Layer III processes the audio data in frames of
1,152 samples.
Unlike Layer II, the coded data representing these samples do not
necessarily fit into a fixed length frame in the code bitstream.
The encoder can donate bits to a reservoir when it needs fewer
than the average number of bits to code a frame.

4/5/2004

Nguyen Chan Hung - Hanoi University of Techonology

27

MP3 frame Bit Reservoir (2)

Later, when the encoder needs more than


the average number of bits to code a frame,
it can borrow bits from the reservoir.
The encoder can only borrow bits donated
from past frames; it cannot borrow from
future frames.
MP3 bitstream includes a 9-bit pointer,
"main_data_begin," with each frame's side
information pointing to the location of the
starting byte of the audio data for that frame.

4/5/2004

Nguyen Chan Hung - Hanoi University of Techonology

28

MP3: Hybrid frequency analysis

Purpose

Increase the frequency resolution in subbands for better


perceptural coding.
Allow for some cancelation of aliasing caused by
polyphase analysis subband filters.

MDCT (Modified Discrete Cosine Transform)

4/5/2004

50% overlapped transform


Short-window MDCT: 6 sub-subbands (12 point DCT) in
each subband. Better time resolution.
Long window MDCT: 18 sub-subbands (36 point DCT) in
each subband. Better frequency resolution.

Nguyen Chan Hung - Hanoi University of Techonology

29

MP3 Decoder

4/5/2004

Nguyen Chan Hung - Hanoi University of Techonology

30

MP3 Performance
Sound quality

Bandwidth

Mode

Bitrate

Reduction ratio

Telephone sound

2.5 kHz

mono

8 kbps *

96:1

Short wave

4.5 kHz

mono

16 kbps

48:1

AM radio

7.5 kHz

mono

32 kbps

24:1

FM radio

11 kHz

stereo

56...64 kbps

26...24:1

Near-CD

15 kHz

stereo

96 kbps

16:1

CD

>15 kHz

stereo

112..128kbps

14..12:1

4/5/2004

Nguyen Chan Hung - Hanoi University of Techonology

31

MPEG-2 Audio

Difference between MPEG-1 and MPEG-2


audio for two-channel stereo

4/5/2004

Initial PCM sampling rate extends to include 16,


22.05, 24 kHz.
Pre-assigned bitrates are extended to as low as 8
kbits/s.
Provide better quantization tables for lower rates.
Improve the coding efficiency of the coding of
scale_factor and intensity_mode stereo in Layer
III.
Nguyen Chan Hung - Hanoi University of Techonology

32

MPEG-2 Audio: Backward Compatible (BC)

Define a five-channel surround


sound

MPEG-1 decoder can decode the L


and R signal.
Coding method:

Front left (L), front right (R), front


center (C), side/rear left (LS),
side/rear right (RS), and (optional)
low-frequency enhancement (LFE)

L and R channels are coded as


MPEG-1 does.
Additional channels are coded as
ancillary data in the MPEG-1 audio
stream.

3/2 stereo: L, R, C, LS, RS


5.1 channel stereo: L, R, C, LS, RS,
LFE

4/5/2004

Nguyen Chan Hung - Hanoi University of Techonology

33

MPEG-2 Audio frame


Header

CRC

Bit Allocation

SCFSI

Scale factor

Samples

MC

MC

MC

MC

MC

Header

CRC

Bit Allocation

SCFSI

Predictor

Ancillary data 1

MC Samples

Ancillary data 2

Multi-lingual
Commentary

Multi-Channel (MC) audio data information

As can be seen on the Figure, MPEG-2 Audio frame


is an extension of MPEG-1 frame, which supports
multi-channel and multi-lingual.

4/5/2004

Nguyen Chan Hung - Hanoi University of Techonology

34

MPEG-2 BC and MPEG-1 compatibility


MPEG-2
MPEG-2
MPEG-1
MPEG-1
Mono & Stereo
32, 44.1, 48 Khz
Layer
LayerI I

Layer
LayerIIII

Low
Low
Frequency
Frequency

Layer
LayerIIIIII

MultiMultiChannel
Channel

Mono & Stereo


18, 22.05, 24 Khz
Layer
LayerI I

4/5/2004

Layer
LayerIIII

5 channels
32, 44.1, 48 Khz
Layer
LayerIIIIII

Layer
LayerI I

Nguyen Chan Hung - Hanoi University of Techonology

Layer
LayerIIII

Layer
LayerIIIIII

35

MPEG-2 Audio: Advanced Audio Coding


(AAC)

To further improve the quality of compressed


audio using state-of-the-art technologies.
It was designated as MPEG-2 NBC (Non
Backward Compatible)
Initial PCM sampling rate: 8 kHz to 96 kHz.
Support from mono up to 48 audio channels

4/5/2004

Nguyen Chan Hung - Hanoi University of Techonology

36

Key Points

MPEG Audio Specifications


MPEG Audio mechanism

MPEG-1 Audio encoding/decoding

Human hearing & Audio masking


Sub-band coding (SBC) mechanism
Psychoacoustic model
The concept of layers
Layer I Layer II Layer III
Differences

MPEG-2 Audio BC
MPEG-2 AAC (NBC)

4/5/2004

Nguyen Chan Hung - Hanoi University of Techonology

37

Das könnte Ihnen auch gefallen