Sie sind auf Seite 1von 8

Survey of speech coding in cellular systems

GSM
Gizachew Hailegebriel Mako
Department of Electrical and Computer Engineering, Addis Ababa University
wingiza@gmail.com
Addis Ababa, Ethiopia

Abstract
There has been substantial progress in speech
coding technology research and standardization in
the past few decades. The main focus had been to
represent speech signal using as few bits as
possible, while at the same time maintaining a
reasonable level of speech quality. In this paper
we present a survey of speech coding
technologies. The types, methods and standards
of common speech coding will be investigated
with particular emphasis on speech coding for
GSM network.
1. Introduction
Although with the emergence of optical fibers
bandwidth in wired communications has become
inexpensive, there is a growing need for
bandwidth conservation and enhanced privacy in
wireless cellular and satellite communications.
Digital speech brings flexibility and opportunities
for encryption; it is also associated (when
uncompressed) with a high data rate and hence
high requirements of transmission bandwidth and
storage. Speech Coding or Speech Compression is
the field concerned with obtaining compact digital
representations of voice signals for the purpose of
efficient transmission or storage. Speech coding
involves sampling and amplitude quantization.
While the sampling is almost invariably done at a
rate equal to or greater than twice the bandwidth
of analog speech, there has been a great deal of
variability among the proposed methods in the
representation of the sampled waveform. The
objective in speech coding is to represent speech
with a minimum number of bits while maintaining
its perceptual quality. The quantization or binary
representation can be direct or parametric. Direct
quantization implies binary representation of the

speech samples themselves while parametric


quantization involves binary representation of
speech model and/or spectral parameters. Speech
is generally band limited to 4 kHz and sampled at
8 kHz.
Speech coding means the rate at which speech
data is converted into digital data by maintaining
acceptable speech quality and mapping fewer bits
for each digitized voice samples. The oldest
technique used in all the telephone exchanges is
PCM which provides speech data at 64 kbps. Later
techniques were designed to decrease the speech
rate due to limitation of bandwidth in the air
interface
standards
of
various
wireless
technologies such as GSM, CDMA, LTE and
more. The reduction in speech codec data rate
should not impact the quality of the speech. This is
the utmost priority of all the speech codec.
2. Objective
The purpose of speech compression is to
reduce the number of bits required to represent
speech signals (by reducing redundancy) in order
to minimize the requirement for transmission
bandwidth for voice transmission over mobile
channels with limited capacity. We start
describing speech compression coding techniques,
after showing properties of speech and then
performance measurement of coding algorithms.
Finally application of speech compression
techniques in GSM has been discussed.
3. Modeling speech production system
The human speech production system can be
modeled using a rather simple structure: the
lungsgenerating the air or energy to excite the
vocal tractare represented by a white noise
source. The acoustic path inside the body with all

its components is associated with a time-varying


lter.

Figure 1: The modeling human speech production system

The filter has the general form of


H(z) =

, where a1am are filter

coefficients

The parameters of the time-varying lter are


estimated by using linear prediction method. In
parametric coding only those predicted parameters
are transmitted to the decoder instead of the
quantized speech as in case of PCM. [7][11]
4. Speech Codecs
4.1 Speech Properties
Speech signals are non-stationary and at
best they can be considered as quasi-stationary
over short segments, typically 5-20ms.The
statistical and spectral properties of speech are
thus defined over short segments. Speech can
generally be classified as voiced, created by air
passed through the vocal cords (e.g. a, I, ah, v),
unvoiced, created by air through mouth and lips
(e.g. sh, s, f), or mixed/transitional and silence.
Time and frequency domain plots for sample
voiced and unvoiced segments are shown in Fig. 2.
Voiced speech is quasi-periodic in the timedomain and harmonically structured in the
frequency-domain while unvoiced speech is
random-like and broadband. In addition, the
energy of voiced segments is generally higher than
the energy of unvoiced segments. The short-time
spectrum of voiced speech is characterized by its
fine and formant structure. The fine harmonic
structure is a consequence of the quasi-periodicity
of speech and may be attributed to the vibrating
vocal chords. The formant structure (spectral
envelope) is due to the interaction of the source
and the vocal tract. [5][11]

Figure 2: Voiced and unvoiced segments and their


short time spectra

4.2 Coding Performance


A speech coding algorithm is evaluated
based on the bit rate, the quality of reconstructed
("coded") speech, the complexity of the algorithm,
the delay introduced, and the robustness of the
algorithm to channel errors and acoustic
interference. In general high-quality speech coding
at low-rates is achieved using high complexity
algorithms. For example, real-time implementation
of a low-rate hybrid algorithm must be typically
done on a digital signal processor capable of
executing 12 or more million instructions per
second (MIPS). The one-way delay (coding plus
decoding delay only) introduced by such
algorithms is usually between 50 to 60ms. Robust
speech coding systems incorporate error correction
algorithms to protect the perceptually important
information against channel errors.
In digital communications speech quality
is classified into four general categories, namely:
broadcast, network or toll, communications, and
synthetic. Broadcast wideband speech refers to
high quality "commentary" speech that can
generally be achieved at rates above 64 kbits/s.
Toll or network quality refers to quality
comparable to the classical analog speech (2003200 Hz) and can be achieved at rates above 16
kbits/s.
Communications
quality
implies
somewhat degraded speech quality which is
nevertheless natural, highly intelligible, and
adequate for telecommunications. Synthetic
speech is usually intelligible but can be unnatural
and associated with a loss of speaker
recognizability. Communications speech can be
achieved at rates above 4.8 kbits/s and the current
goal in speech coding is to achieve
communications quality at 4.0 kbits/s. Currently,
speech coders operating well below 4.0 kbits/s
tend to produce speech of synthetic quality.

Gauging the speech quality is an important


but also very difficult task. The signal-to-noise
ratio (SNR) is one of the most common objective
measures for evaluating the performance of a
compression algorithm. This is given by:

quality, and a MOS between 2.5 and 3.5 implies


synthetic quality. We note here that MOS ratings
may differ significantly from test to test and hence
they are not absolute measures for the comparison
of different coders. [3][5]
4.3 Commonly used speech codecs

, Where

data while

is the original speech

is the coded speech data.

Temporal variations of the performance can be


better detected and evaluated using a short-time
signal-to-noise ratio, i.e., by computing the SNR
for each N-point segment of speech.
A performance measure that exposes weak
signal performance is the segmental SNR
(SEGSNR) which is given by:

Objective measures are often sensitive to


both gain variations and delays. Subjective test
procedures like the Mean Opinion Score (MOS)
are based on listener ratings and widely used to
quantify coded speech quality. The MOS usually
involves 12 to 24 listeners who are instructed to
rate phonetically balanced records according to a
5-level quality scale, Table. 1. Excellent speech
quality implies that coded speech is
indistinguishable from the original and without
perceptible noise. On the other hand, bad
(unacceptable) quality implies the presence of
extremely annoying noise and artifacts in the
coded speech.

Here we briefly discuss the main speech


coding techniques which are used today, and those
which may be used in the future. In order to
simplify the description of speech codecs they are
often broadly divided into three classes:
i. Waveform Coders
Time Domain: (PCM, ADPCM)
Frequency
Domain:
Sub-band
coders, Adaptive transform coders
ii. Vocoders

Linear Predictive Coders

Formant Coders
iii. Hybrid
Typically waveform codecs are used at
high bit rates, and give very good quality speech.
Vocoders/Source codecs operate at very low bit
rates, but tend to produce speech which sounds
synthetic. Hybrid codecs use techniques from both
source and waveform coding, and give good
quality speech at intermediate bit rates. This is
shown in Figure 3, which shows how the speech
qualities of the three main classes of speech
codecs vary with the bit rate of the codec.

Figure 3: Speech Quality versus Bit Rate for


Common Classes of Codecs

Ratings are obtained by averaging numerical


scores over several hundreds of speech records.
MOS scale for some of speech coding algorithms:
PCM 4.3, ADPCM 4.1, CELP 3.7 and RPE-LPC
3.54. The MOS range relates to speech quality as
follows: a MOS of 4-4.5 implies network quality,
scores between 3.5 and 4 imply communications

4.3.1

Waveform Coding

Waveform-based codecs are intended to


remove waveform correlation between speech
samples to achieve speech compression. It aims to
minimize the error between the reconstructed and
the original speech waveforms. Generally they are

low complexity codecs which produce high quality


speech at rates above about 16kbits/s. When the
data rate is lowered below this level the
reconstructed speech quality that can be obtained
degrades rapidly. [8]
The simplest form of waveform coding is
PCM, which merely involves sampling and
quantizing the input waveform. Narrow-band
speech is typically band-limited to 4 kHz and
sampled at 8 kHz. If linear quantization is used
then to give good quality speech around twelve
bits per sample are needed, giving a bit rate of
96kbits/s. This bit rate can be reduced by using
non-uniform quantization of the samples. In
speech coding an approximation to a logarithmic
quantizer is often used. Such quantizers give a
signal to noise ratio which is almost constant over
a wide range of input levels, and at a rate of eight
bits/sample (or 64kbits/s) give a reconstructed
signal which is almost indistinguishable from the
original.
A commonly used technique in speech
coding is to attempt to predict the value of the next
sample from the previous samples. It is possible to
do this because of the correlations present in
speech samples due to the effects of the vocal tract
and the vibrations of the vocal cords. If the
predictions are effective then the error signal
between the predicted samples and the actual
speech samples will have a lower variance than the
original speech samples. Therefore we should be
able to quantize this error signal with fewer bits
than the original speech signal. This is the basis of
Differential Pulse Code Modulation (DPCM)
schemes - they quantize the difference between the
original and predicted signals. The results from
such codecs can be improved if the predictor and
quantizer are made adaptive so that they change to
match the characteristics of the speech being
coded. This leads to adaptive Differential PCM
(ADPCM) codecs. The waveform codecs
described above all code speech with an entirely
time domain approach. Frequency domain
approaches are also possible, and have certain
advantages. For example in Sub-Band Coding
(SBC) the input speech is split into a number of
frequency bands, or sub-bands, and each is coded
independently using for example an ADPCM like
coder. At the receiver the sub-band signals are
decoded and recombined to give the reconstructed

speech signal. The advantages of doing this come


from the fact that the noise in each sub-band is
dependent only on the coding used in that subband. Therefore we can allocate more bits to
perceptually important sub-bands so that the noise
in these frequency regions is low, while in other
sub-bands we may be content to allow a high
coding noise because noise at these frequencies is
less perceptually important. Adaptive bit
allocation schemes may be used to further exploit
these ideas. Sub-band codecs tend to produce
communications to toll quality speech in the range
16-32kbits/s. Due to the filtering necessary to split
the speech into sub-bands they are more complex
than simple DPCM coders, and introduce more
coding delay. However the complexity and delay
are still relatively low when compared to most
hybrid codecs.
Another frequency domain waveform
coding technique is Adaptive Transform Coding
(ATC), which uses a fast transformation (such as
the discrete cosine transformation) to split blocks
of the speech signal into a large numbers of
frequency bands. The number of bits used to code
each transformation coefficient is adapted
depending on the spectral properties of the speech,
and toll quality reproduced speech can be achieved
at bit rates as low as 16kbits/s.
4.3.2 Vocoders/Source/Parametric Coding
Waveform-based coding aims to reduce
redundancy among speech samples and to
reconstruct speech as close as possible to the
original speech waveform. Due to its nature of
speech sample-based compression, waveformbased coding cannot achieve high compression
ratio and normally operates at bit rate ranging
from 64 kb/s to 16 kb/s. In contrast, parametricbased compression methods are based on how
speech is produced. Instead of transmitting speech
waveform samples, parametric compression only
sends relevant parameters related with speech
production to the receiver side and reconstructs the
speech from the speech production model. Thus,
high compression ratio can be achieved.
Source coders operate using a model of
how the source was generated, and attempt to
extract, from the signal being coded, the
parameters of the model. It is these model

parameters which are transmitted to the decoder.


Source coders for speech are called vocoders, and
work as follows. The vocal tract is represented as a
time-varying filter and is excited with either a
white noise source, for unvoiced speech segments,
or a train of pulses separated by the pitch period
for voiced speech. Therefore the information
which must be sent to the decoder is the filter
specification, a voiced/unvoiced flag, the
necessary variance of the excitation signal, and the
pitch period for voiced speech. This is updated
every 10-20 ms to follow the non-stationary nature
of speech.
The model parameters can be determined
by the encoder in a number of different ways,
using either time or frequency domain techniques.
Also the information can be coded for
transmission in various different ways. Vocoders
tend to operate at around 2.4kbits/s or below, and
produce speech which although intelligible is far
from natural sounding. Increasing the bit rate
much beyond 2.4kbits/s is not worthwhile because
of the inbuilt limitation in the coder's performance
due to the simplified model of speech production
used. The main use of vocoder has been in military
applications where natural sounding speech is not
as important as a very low bit rate to allow heavy
protection and encryption.
There are several proposed models in the
literature. The most successful, however, is based
on linear prediction. In this approach, the human
speech production mechanism is summarized
using a time-varying filter with the coefficients of
the filter found using the linear prediction analysis
procedure. This class of coders works well for low
bit-rate. Increasing the bit-rate normally does not
translate into better quality, since it is restricted by
the chosen model. Typical bit-rate is in the range
of 2 to 5 kbps. Example coders of this class
include linear prediction coding (LPC) and mixed
excitation linear prediction MELP.[7][8]

provide intelligible speech at 2.4kbits/s and below,


but cannot provide natural sounding speech at any
bit rate. Although other forms of hybrid codecs
exist, the most successful and commonly used are
time domain Analysis-by-Synthesis (Abs) codecs.
Such coders use the same linear prediction filter
model of the vocal tract as found in LPC vocoders.
However instead of applying a simple two-state,
voiced/unvoiced, model to find the necessary input
to this filter, the excitation signal is chosen by
attempting to match the reconstructed speech
waveform as closely as possible to the original
speech waveform. Abs codecs were first
introduced in 1982 by Atal and Remde with what
was to become known as the Multi-Pulse Excited
(MPE) codec. Later the Regular-Pulse Excited
(RPE) and the Code-Excited Linear Predictive
(CELP) codecs were introduced. In RPE codecs
the pulses are regularly spaced at some fixed
interval. This means that the encoder needs only to
determine the position of the first pulse and the
amplitude of all the pulses, whereas with MPE
codecs the positions of all of these non-zero pulses
within the frame, and their amplitudes, must be
determined by the encoder and transmitted to the
decoder. Therefore, with RPE codec less
information needs to be transmitted about pulse
positions and this is critically important in mobile
solutions like GSM where the bandwidth is
especially scarce. Both codecs can provide good
quality speech at rates of around 10 kbits/s. They
are not suitable for rates much below this. This is
due to the large amount of information that must
be transmitted about the excitation pulses'
positions and amplitudes. This class dominates the
medium bit-rate coders, with the code-excited
linear prediction (CELP) algorithm and its variants
the most outstanding representatives. A hybrid
coder tends to behave like a waveform coder for
high bit-rate, and like a parametric coder at low
bit-rate, with fair to good quality for medium bitrate. [10][11]
4.4 Applications of Speech Coding in GSM

4.3.3 Hybrid Codecs


Hybrid codecs attempt to fill the gap
between waveform and source codecs. As
described above waveform coders are capable of
providing good quality speech at bit rates down to
about 16kbits/s, but are of limited use at rates
below this. Vocoders on the other hand can

GSM uses 13 kbps speech data rate using


CELP technique. The other speech codec available
in GSM include FR (Full Rate) or RPE-LPC, HR
(Half Rate), EFR (Enhanced Full Rate) and AMR
(Adaptive Multi Rate). FR provides 13 kbps, HR
provides 6.5kbps, EFR provides 12.2kbps and
AMR provides from 4.75 to about 12.2 kbps.

Here is description for some of GSM speech codec


techniques below.
CELP speech codec:
With CELP (Code Excited Linear
Prediction) algorithms are designed to achieve
about 8kbps/4.8Kbps of speech compression
maintaining the acceptable speech quality. Here
coder and decoder have a predetermined codebook
of stochastic (zero-mean white Gaussian)
excitation signals. For each speech signals the
transmitter searches through its codebook of
stochastic signals for the one that gives the best
perceptual match to the sound when used as an
excitation to the LPC filter. The index of the
codebook where the best match was found is then
transmitted. The receiver uses this index to pick
the correct excitation signal for its synthesizer
filter. Following are the standards most popular for
CELP based codec design. G.728- It is the
standard which performs speech coding at 16 kbps
and LD-CELP ( Low Delay Code Excited Linear
Prediction) speech codec is used for this purpose.
G.729 - It is the standard which performs audio
compression and provides 8 kbps speech rate. CSACELP (Conjugate Structure Algebraic Code
Excited Linear Prediction) technique is used for
this purpose. CS-ACELP operates at 8 kb/s with
10 ms speech frame length, plus 5 ms look-ahead
(a total algorithmic delay of 15 ms).[6][10][11]
Full Rate / RPE-LPC codec:
RPE-LPC(Regular Pulse Excited - Linear
Predictive Coder), this form of voice codec was
the first speech codec used with GSM and it
chosen after tests were undertaken to compare it
with other codec schemes of the day. The speech
codec is based upon the regular pulse excitation
LPC with long term prediction. The basic scheme
is related to two previous speech codecs, namely:
RELP, Residual Excited Linear Prediction and to
the MPE-LPC, Multi Pulse Excited LPC. The
advantages of RELP are the relatively low
complexity resulting from the use of baseband
coding, but its performance is limited by the tonal
noise produced by the system. The MPE-LPC is
more complex but provides a better level of
performance. The RPE-LPC codec provided a
compromise between the two, balancing

performance and complexity for the technology of


the time.
Despite the work that was undertaken to
provide the optimum performance, as technology
developed further, the RPE-LPC codec was
viewed as offering a poor level of voice quality.
As other full rate audio codecs became available,
these were incorporated into the system.
[6][9][10][11]
EFR - Enhanced Full Rate codec:
Later another vocoder called the Enhanced
Full Rate (EFR) vocoder was added in response to
the poor quality perceived by the users of the
original RPE-LPC codec. This new codec gave
much better sound quality and was adopted by
GSM. Using the ACELP compression technology
it gave a significant improvement in quality over
the original RPE-LPC encoder. It became possible
as the processing power that was available
increased in mobile phones as a result of higher
levels of processing power combined with their
lower current consumption.[9][10]
Half Rate codec:
The GSM standard allows the splitting of
a single full rate voice channel into two subchannels that can maintain separate calls. By doing
this, network operators can double the number of
voice calls that can be handled by the network
with very little additional investment.
To enable this facility to be used a half rate codec
must be used. The half rate codec was introduced
in the early years of GSM but gave a much inferior
voice quality when compared to other speech
codecs. However it gave advantages when demand
was high and network capacity was at a premium.
The GSM Half Rate codec uses a VSELP
codec algorithm. It uses vector-sum excited linear
prediction codebook with each codebook vector is
formed up by a linear combination of fixed basis
vectors. The speech frame length is 20 ms and is
divided into four sub frames (5 ms each). It codes
the data around 20 ms frames each carrying 112
bits to give a data rate of 5.6 kbps. This includes a
100 bps data rate for a mode indicator which
details whether the system believes the frames
contain voice data or not. This allows the speech
codec to operate in a manner that provides the
optimum quality. The Half Rate codec system was

introduced in the 1990s, but in view of the


perceived poor quality, it was not widely used.
[9][10][11]
AMR Codec:
The AMR, Adaptive Multi-rate codec is
now the most widely used GSM codec. The AMR
codec was adopted by 3GPP in October 1988 and
it is used for both GSM and circuit switched
UMTS/WCDMA voice calls. The use of the AMR
codec also requires that optimized link adaptation
is used so that the optimum data rate is selected to
meet the requirements of the current radio channel
conditions including its signal to noise ratio and
capacity. This is achieved by reducing the source
coding and increasing the channel coding.
Although there is a reduction in voice clarity, the
network connection is more robust and the link is
maintained without dropout. Improvement levels
of between 4 and 6 dB may be experienced.
However network operators are able to prioritize
each station for either quality or capacity.
AMR concept is based on signal quality.
Different rates are available based on redundancy
required to be introduced based on carrier to
interference ratio(C/I), which provides less or
more error correction capability. In AMR, speech
codec can be changed either by network or by
GSM mobile. It is changed every two speech
frames. If it is mobile initiated then, Mobile sends
CMR (Codec Mode Request) to the network (i.e.
BTS) and BTS responds with CMI (Codec Mode
Indication). If it is network initiated then, BTS
sends CMC (Codec Mode Command) to the
mobile. There are 14 modes in AMR speech
codec, 8 in full rate (AMR narrowband) and 6 in
half rate (AMR wideband), Table 2.
Table 2 AMR speech codec modes

and Half Rate (HR). The aim is to improve


channel (FR/HR) quality by adapting the most
appropriate channel codec based on the current
radio conditions. With AMR, the speech capacity
is increased by using the half rate (HR) mode and
still maintaining the quality level of current FR
calls. The idea behind the AMR codec concept is
that it is capable of adapting its operation
optimally according to the prevailing channel
conditions. The speech coder is capable
(theoretically) of switching its bit-rate every 20 ms
speech frame upon command. [2][10]
5. Summary
In this paper, we have discussed properties
of speech, how speech is compressed, and
performance of coding algorithms' and speech
compression or coding techniques. We focused
mainly on GSM speech coding techniques. We
presented three key speech coding techniques
which are waveform, parametric and hybrid
coding. For waveform coding, we mainly
explained PCM, ADPCM which are widely used
for both narrowband and wideband speech coding.
For parametric coding, we explained the concept
of parametric coding techniques, such as LPC. For
hybrid coding, we started from the problems with
waveform and parametric coding techniques, the
need to develop high speech quality and high
compression ratio speech codecs, and then
reviewed the revolutionary Analysis-by-Synthesis
(AbS) and CELP (Code Excited Linear Prediction)
approach.
We then presented some of the speech
coding techniques used in GSM.GSM uses 13
kbps speech data rate using CELP technique. The
other speech codec available in GSM include FR
(Full Rate) or RPE-LPC, HR (Half Rate), EFR
(Enhanced Full Rate) and AMR (Adaptive Multi
Rate). FR provides 13 kbps, HR provides 6.5kbps,
EFR provides 12.2kbps and AMR provides from
4.75 to about 12.2 kbps.
6. Reference

AMR consists on a family of codec with different


Channel Coding operating in GSM Full Rate (FR)

[1] IEEE paper: speech coding in mobile


radio communication, Jerry Gibson,
August 1998
[2] Publication, 3GPP 51.010 specifications,
26.16.2

[3] Book: speech coding methods, standards


and application, Jerry Gibson
[4] Book: Challenges in Speech Coding
Research, Jerry D. Gibson
[5] Paper:
Speech Coding: A Tutorial
Review, Spanias
[6] Paper: Speech Coding II, by Jan
Cernocky, ValentinaHubeika
[7] Book:
Speech coding algorithm,
WaiC.Chu
[8] Book: Guide to voice and video over IP
for fixed and mobile networks,
L.Nkwawa, Jammeh, E.Ifeacher, 2013
[9] IEEE paper: GSM SPEECH CODING
AND SPEAKER RECOGNITION, L.
Besacier, S. Grassi, A. Dufaux, M.
Ansorge, F. Pellandini
[10] Book: Speech Compression and Coding
Techniques, Chapter 2 : Speech
compression
[11] Paper: Digital Signal Processing and
Filtering, GSM Codec, by Kristo
Lehtonen

Das könnte Ihnen auch gefallen