Audio Compression

Audio Compression
Techniques
1
Introduction
 Digital Audio Compression
 Removal of redundant or otherwise irrelevant
information from audio signal
 Audio compression algorithms are often referred to as
“audio encoders”
 Applications
 Reduces required storage space
 Reduces required transmission bandwidth
2
Audio Compression
 Audio signal – overview
 Sampling rate (# of samples per second)
 Bit rate (# of bits per second). Typically,
uncompressed stereo 16-bit 44.1KHz signal has a
1.4MBps bit rate
 Number of channels (mono / stereo / multichannel)
 Reduction by lowering those values or by data
compression / encoding
3
Why Compression is Needed
 Data rate = sampling rate * quantization
bits * channels (+ control information)
 For example (digital audio):

 44100 Hz; 16 bits; 2 channels
 generates about 1.4M of data per second;
84M per minute; 5G per hour
Audio Data Compression
 Redundant information
 Implicit
in the remaining information
 Ex. oversampled audio signal
 Irrelevant information
 Perceptuallyinsignificant
 Cannot be recovered from remaining
information
5
 Lossless Audio Compression
 Removes redundant data
 Resulting signal is same as original – perfect
reconstruction E.g. Huffmann, LZW
 Lossy Audio Encoding
 Removes irrelevant data
 Resulting signal is similar to original
E.g. ADPCM, LPC
6
 Audio vs. Speech Compression
Techniques
 Speech Compression uses a human vocal
tract model to compress signals
 Audio Compression does not use this
technique due to larger variety of possible
signal variations
7
Generic Audio Encoder
 Psychoacoustic Model
 Psychoacoustics – study of how sounds are
perceived by humans
 Uses perceptual coding
 eliminate information from audio signal that is
inaudible to the ear
 Detectsconditions under which different audio
signal components mask each other
8
Additional Encoding Techniques
 Other encoding techniques techniques are
available (alternative or in combination)
 Predictive Coding
 Coupling / Delta Encoding
 Huffman Encoding
9
 Predictive Coding
 Often used in speech and image compression
 Estimates the expected value for each sample based
on previous sample values
 Transmits/stores the difference between the expected
and received value
 Generates an estimate for the next sample and then
adjusts it by the difference stored for the current
sample
 Used for additional compression in MPEG2 AAC
10
 Coupling / Delta encoding
 Used in cases where audio signal consists of two or
more channels (stereo or surround sound)
 Similarities between channels are used for
compression
 A sum and difference between two channels are
derived; difference is usually some value close to
zero and therefore requires less space to encode
 This is a case of lossless encoding process
11
 Huffman Coding
 Information-theory-based technique
 An element of a signal that often reoccurs in the
signal is represented by a simpler symbol, and its
value is stored in a look-up table
 Implemented using a look-up tables in encoder and in
decoder
 Provides substantial lossless compression, but
requires high computational power and therefore is
not very popular
 Used by MPEG1 and MPEG2 AAC
12
Psychoacoustics
Limits of Human Hearing
– Time Domain Considerations
– Frequency Domain (Spectral) Considerations
– Amplitude vs. Power
– Masking in Time and Frequency Domains
– Sampling Rate and Signal Bandwidth

Limits of Human
Hearing
 Time and Frequency
Events longer than 0.03 seconds are resolvable in time

shorter events are perceived as features in frequency
20 Hz. < Human Hearing < 20 KHz.

(for those from 18 to 25)
“Pitch” is PERCEPTION related to FREQUENCY

Human Pitch Resolution is about 40 - 4000 Hz.
Limits of Human
Hearing
 Amplitude or Power???
– “Loudness” is PERCEPTION related to POWER,
not AMPLITUDE
– Power is proportional to (integrated) square of signal
– Human Loudness perception range is about 120 dB,

where +10 db = 10 x power = 20 x amplitude
– Waveform shape is of little consequence. Energy at

each frequency, and how that changes in time, is the
most important feature of a sound.
Limits of Human
Hearing
 Waveshape or Frequency Content??
– Here are two waveforms with identical power spectra,
and which are (nearly) perceptually identical:
Wave 1
Wave 2
Magnitude
Spectrum
of Either
Masking in Amplitude, Time, and Frequency
– Masking in Amplitude: Loud sounds „mask‟ soft ones.

Example: Quantization Noise
– Masking in time: A soft sound just before a louder

sound is more likely to be heard than if it is just after.
– Masking in Frequency: Loud „neighbor‟ frequency masks soft spectral

components. Low sounds mask higher ones more than high masking low.
 Masking in Amplitude
 Intuitively, a soft sound will not be heard if there
is a competing loud sound. Reasons:
 Gain controls in the ear
stapedes reflex and more
 Interaction
(inhibition) in the cochlea
 Other mechanisms at higher levels
 Masking in Time
 In the time range of a few milliseconds:
A soft event following a louder event tends to be

grouped perceptually as part of that louder event
 Ifthe soft event precedes the louder event, it

might be heard as a separate event (become
audible)
 Masking in Frequency
Only one component in this spectrum is

audible because of frequency masking
Spectral Analysis
 Tasks of Spectral Analysis
 To derive masking thresholds to determine
which signal components can be eliminated
 To generate a representation of the signal to
which masking thresholds can be applied
 Spectral Analysis is done through
transforms or filter banks
21
Spectral Analysis
 Transforms
 Fast Fourier Transform (FFT)
 Discrete Cosine Transform (DCT) - similar to
FFT but uses cosine values only
 Modified Discrete Cosine Transform (MDCT)
[used by MPEG-1 Layer-III, MPEG-2 AAC,
Dolby AC-3] – overlapped and windowed
version of DCT
22
Spectral Analysis
 Filter Banks
 Time sample blocks are passed through a set
of bandpass filters
 Masking thresholds are applied to resulting
frequency subband signals
 Poly-phase and wavelet banks are most
popular filter structures
23
Compression Models
•Perceptual Models
•Production Models
•Event Based Models
24
Perceptual Models
 Exploit masking, etc., to discard perceptually
irrelevant information.
 Example: Quantize soft sounds more accurately,
loud sounds less accurately
 Benefits: Generic, does not require assumptions

about what produced the sound
 Drawbacks: Highest compression is difficult to achieve
Loudness and Pitch
(Review on Psychoacoustic Effects)
 More sensitive to loudness at mid

frequencies than at other frequencies
 intermediate
frequencies at [500hz, 5000hz]
 Human hearing frequencies at [20hz,20000hz]
 Perceived loudness of a sound changes

based on frequency of that sound
 basilarmembrane reacts more to intermediate
frequencies than other frequencies
Fletcher-Munson Contours
Each contour represents an equal perceived sound

Perception sensitivity (loudness) is not linear across all frequencies and intensities
Production Models
 Build a model of the sound production system, then
fit the parameters
 Example: If signal is speech, then a well-

parameterized vocal model can yield
highest quality and compression ratio
 Benefits: Highest possible compression

 Drawbacks: Signal source(s) must be assumed, known, or
identified
MIDI and Other „Event‟ Models
 Musical Instrument Digital Interface
Represents Music as Notes and Events and uses a
synthesis engine to “render” it.
 An Edit Decision List (EDL) is another example.

A history of source materials, transformations, and
processing steps is kept. Operations can be undone or
recreated easily.
Future: Multi-Model
Parametric Compressors?
 Analysis front end identifies source(s)
 Audio is (separated and) sent to optimal
model(s)
Benefits:
 High compression
Drawbacks:
 Complexity
MPEG-1 Audio Encoding
 Characteristics
 Precision16 bits
 Sampling frequency: 32KHz, 44.1 KHz, 48
KHz
 3 compression layers: Layer 1, Layer 2, Layer
3 (MP3)
 Layer I: Uses sub-band coding 32-448 kbps, target 192
kbps
 Layer II: Uses sub-band coding (longer frames, more
compression) 32-384 kbps, target 128 kbps
 Layer III: Uses both sub-band coding and transform coding
32-320 kbps, target 64 kbps
MPEG Audio Encoding Steps
MPEG Audio Filter Bank
 Sub-band i defined
 Filter bank divides input into multiple sub-bands
(32 equal frequency sub-bands)
7 7
(2i  1)(k  16)
St[i]   3 cos( * (C[k  64 j ] * x[k  64 j ]
k 0 j 0 64
 i [0,31], St[i] - filter output sample for sub-band

i at time t,
 C[n] – one of 512 coefficients,
 x[n] – audio input sample from 512 sample
buffer
MPEG/audio divides audio signal into frequency sub-bands that approximate critical
bands. Then we quantize each sub-band according to the audibility of quantization
noise within the band
Dominant band and the mask
 Dominant band is found and the corresponding
mask is applied
Quantization of Audible Sound
 The components exceed the mask are
quantized and encoded using the Huffman
coding method
 Masking and Quantization (Example) performing the
sub-band filtering step on the input results in the
following values (for demonstration, we are only looking
at the first 16 of the 32 bands):
 Level 0 8 12 10 6 2 10 60 35 20 15 2 3 5 3 1
 Band 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
 The 60dB level of the 8th band gives a masking of 12 dB in the 7th band,
15dB in the 9th. (according to the Psychoacoustic model) The level in
7th band is 10 dB ( < 12 dB ), so ignore it. The level in 9th band is 35 dB
( > 15 dB ), so send it. We only send the amount above the masking level
 Determine number of bits needed to represent the coefficient such that, the
noise introduced by quantization is below the masking effect i.e. [noise
introduced = 12dB; masking = 15 dB]
38
Rate control loop
 For a given bit rate allocation, adjust the
quantization steps to achieve the bit rate.
 This loop checks if the number of bits
resulting from the coding operation exceeds
the number of bits available to code for a
given block of data.
 If it is true, then the quantization step is
increased to reduce the total bits.
MPEG Audio Bit Allocation
 This process determines number of code bits allocated to
each sub-band based on information from the psycho-
acoustic model
 Algorithm:
1. Compute mask-to-noise ratio: MNR=SNR-SMR
 Standard provides tables that give estimates for SNR resulting
from quantizing to a given number of quantizer levels
2. Search for sub-band with the lowest MNR
3. Allocate code bits to this sub-band.
 If sub-band allocated gets more code bits than appropriate, look
up new estimate of SNR and repeat step 1
Distortion control loop
 This loop shape the quantization steps according to the
perceptual mask threshold
 Start with a default factor 1.0 for every band
 If the quantization error in a band exceeds the mask
threshold, the scale factor is adjusted to reduce this
quantization error
 This will cause more bits and the rate control loop has
to be invoked every time the scale factors are
changed
 The distortion control is executed until the noise level
is below the perceptual mask for every band
Decoder
 Decoder side is relatively easier. The gain,
scale factor, quantization steps recovered
are used to reconstruct the filter bank
responses.
 Filter bank responses are combined to
reconstruct the decoded audio signal
MPEG Coding Specifications
MPEG Layer I
Filter is applied one frame (12x32 = 384 samples) at a time.
At 48 kHz, each frame carries 8ms of sound.
Uses a 512-point FFT to get detailed spectral information
about the signal. (sub-band filter).
Uses equal frequency spread per band.
Psychoacoustic model only uses frequency masking.
Typical applications: Digital recording on tapes, hard disks,

or magneto-optical disks, which can tolerate the high bit
rate. Highest quality is achieved with a bit rate of 384k
bps.
43
MPEG Layer II
Use three frames in filter (before, current, next, a total of
1152 samples). At 48 kHz, each frame carries 24 ms of
sound.
--Models a little bit of the temporal masking.
--Uses a 1024-point FFT for greater frequency resolution.
--Uses equal frequency spread per band.
--Highest quality is achieved with a bit rate of 256k bps.
Typical applications: Audio Broadcasting, Television,

Consumer and Professional Recording, and Multimedia.
44
MPEG Layer III
Better critical band filter is used

Uses non-equal frequency bands
Psychoacoustic model includes temporal masking effects,
takes into account stereo redundancy, and uses Huffman
coder.
Stereo Redundancy Coding: Intensity stereo coding -- at
upper-frequency sub-bands, encode summed signals
instead of independent signals from left and right channels.
Middle/Side (MS) stereo coding -- encode middle (sum of

left and right) and side (difference of left and right)
channels.
45
Joint Stereo
 Joint stereo coding takes advantage of the fact
that both channels of a stereo channel pair
contain similar information
 These stereophonic irrelevancies and
redundancies are exploited to reduce the total
bitrate.
 Joint stereo is used in cases where only low
bitrates are available but stereo signals are
desired.
MP3 Audio Format
Source: http://wiki.hydrogenaudio.org/images/e/ee/Mp3filestructure.jpg
Successor of MP3
 Advanced Audio Coding (AAC)(MPEG-2
AAC)– now part of MPEG-4 Audio
 Can deliver 320 kbps for five channels (5.1 Channel
system).
 Also capable of delivering high quality stereo sound at
bitrates of below 128 kbps.
 Inclusion of 48 full-bandwidth audio channels
 Support 3 different profiles i.e. Main ,Low Complexity,
Scalable Sampling rate.
 Default audio format for iPhone, iPad, PlayStation, Nokia,
Android, BlackBerry
 Introduced in 1997 as MPEG-2 Part 7
 In 1999 – updated and included in MPEG-4
AAC‟s Improvements over MP3
 More sample frequencies (8-96 kHz)
 Arbitrary bit rates and variable frame
length
 Higher efficiency and simpler filterbank
 Uses pure MDCT (modified discrete cosine
transform)
 Used in Windows Media Audio
MPEG-4 Audio
 Variety of applications
 General audio signals
 Speech signals
 Synthetic audio
 Synthesized speech (structured audio)
MPEG-4 Audio Part 3
 Includes variety of audio coding technologies
 Lossy speech coding (e.g., CELP)
 CELP – code-excited linear prediction – speech
coding
 General audio coding (AAC)
 Hardware data compression
 Text-to-Speech interface
 Structured Audio (e.g., MIDI)
MPEG-4 Part 14
 Called MP4 with Extension .mp4
 Multimedia container format
 Stores digital video and audio streams and
allows streaming over Internet
 Container or wrapper format
 meta-fileformat whose spec describes how
different data elements and metadata coexist
in computer file
Conclusion
 MPEG Audio is an integral part of the
MPEG standard to be considered together
with video
 MPEG-4 Audio represents a major
extension in terms of capabilities to
MPEG-1 Audio

Audio Compression

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Audio Compression

Hochgeladen von

Copyright:

Verfügbare Formate

Audio Compression

 For example (digital audio):

– Time Domain Considerations

– Frequency Domain (Spectral) Considerations

– Amplitude vs. Power

– Masking in Time and Frequency Domains

– Sampling Rate and Signal Bandwidth

Events longer than 0.03 seconds are resolvable in time

20 Hz. < Human Hearing < 20 KHz.

“Pitch” is PERCEPTION related to FREQUENCY

– Power is proportional to (integrated) square of signal

– Human Loudness perception range is about 120 dB,

– Waveform shape is of little consequence. Energy at

– Masking in Amplitude: Loud sounds „mask‟ soft ones.

– Masking in time: A soft sound just before a louder

– Masking in Frequency: Loud „neighbor‟ frequency masks soft spectral

A soft event following a louder event tends to be

 Ifthe soft event precedes the louder event, it

Only one component in this spectrum is

•Event Based Models

 Benefits: Generic, does not require assumptions

 More sensitive to loudness at mid

 Perceived loudness of a sound changes

Each contour represents an equal perceived sound

 Example: If signal is speech, then a well-

 Benefits: Highest possible compression

 An Edit Decision List (EDL) is another example.

 i [0,31], St[i] - filter output sample for sub-band

Typical applications: Digital recording on tapes, hard disks,

Typical applications: Audio Broadcasting, Television,

Better critical band filter is used

Middle/Side (MS) stereo coding -- encode middle (sum of

Das könnte Ihnen auch gefallen