Beruflich Dokumente
Kultur Dokumente
This chapter describes several basic vocoders, in part, for historical back-
ground, but primarily for the purpose of introducing basic concepts that
have been incorporated into subsequent, more complex coders.
In 1939, Homer Dudley published a description of the first vocoder
[33] in the Bell Labs Record. The term vocoder was derived from VOice
CODER. The vocoder was conceived for efficient transmission of speech
signals over expensive long-distance telephone circuits. Vocoders com-
press the speech signal and have evolved to become more efficient over
the years. Dudley is credited with being the first to show that speech
signals can be transmitted over a fraction of the bandwidth occupied by
the original signal when properly coded. Dudley’s vocoder, a type of
channel vocoder, was the first device to realize the promised economy
[143]. The fact that increased economy can be achieved with a vocoder
implied that much of the actual speech signal is redundant.
Many vocoders are based on the source-filter speech model (see Section
2.3). This approach models the vocal tract as a slowly varying linear
filter. The filter is excited by either glottal pulses (modeled as a periodic
signal), turbulence (modeled as white noise), or a combination of the
two. Similarities exist between the source-filter model and actual speech
articulation. The source excitation represents the stream of air blown
through the vocal tract, and the linear filter models the vocal tract.
Unlike waveform coders, which attempt to reconstruct accurate rep-
resentations of the time-domain waveform, vocoders reproduce a signal
that is perceptually similar. It is well established that the human au-
ditory system performs a short-time frequency transform on acoustic
signals prior to neural transduction and perception [38]. Exact preser-
vation of the time waveform is not necessary for perceptually accurate
signal representation.
FIGURE 9.1
Channel vocoder analysis of input speech [137].
The voicing information is derived from the pitch estimator and para-
meterized as a cutoff frequency. Frequencies below the cutoff are voiced
and harmonic. Above the cutoff, they are considered unvoiced and not
harmonic, with random phase.
The decoder decodes the pitch and voicing information and inverse
transforms the cepstral information. This information is combined to
restore the amplitudes, frequencies, and phases for the component sine
waves. The sine waves are synthesized to produce the output speech.
The harmonic components (those below the voicing cutoff frequency) are
synthesized to be in phase with the fundamental pitch frequency. Those
components that are unvoiced (above the voicing cutoff frequency) are
synthesized with a random phase.
Reference [111] contains additional information on improvements to
coding efficiency and robustness to errors for the sinusoidal coder. These
improvements incorporate vector quantization of subband channel ener-
gies.
phases of the FFT of the residual. At the decoder, this base-band resid-
ual is replicated at higher frequency bands to generate a full bandwidth
residual. This synthesized residual is then passed through the LPC filter
to synthesize the speech.
The LPC coder based on the classical two-state voicing decision results
in the lowest bit rate for the residual excitation. The conceptual basis
behind this method is the simple production model of Figure 2.14. For
each frame, the speech is classified as either voiced or unvoiced and, as
such, requires only one bit for coding the voicing.
Figure 9.7 displays a block diagram for the LPC encoder. The input
speech is sampled and segmented into frames. For each frame, an LP
analysis is performed to represent the spectral envelope. Typical LP
orders range from 10 to 14 depending on the bandwidth of the speech
and bit-rate limits. The pitch period is estimated by a separate algo-
rithm. The voiced/unvoiced decision can utilize information about the
harmonic structure (or lack of) in the spectrum from the pitch estima-
tion, or the decision can be based on simple parameters such as the
energy of the frame and number of zero crossings of the time waveform
in the frame. Voiced speech tends to have higher energy and a lower
number of zero crossings than unvoiced speech.
The LPC decoder of Figure 9.8 shows an implementation of the source-
filter speech production model. The pulse generator produces a periodic
waveform at intervals of the pitch period. The noise generator outputs a
random sequence of white (equal power across the spectrum) noise. The
voicing information controls the switch that decides whether to select the
periodic (voiced) or random (unvoiced) excitation signal. The excitation
signal is then frequency shaped by the LPC filter and multiplied by the
gain to produce the correct energy (signal amplitude) of the output
synthesized speech.
FIGURE 9.8
LPC decoder.
for the speech samples, s(n), over the window of length N. For periodic
speech, the AMDF will have a valley for the values of m near the pitch
period. The range of allowable pitch frequencies is limited to 50 to
400 Hz. The AMDF is similar to the autocorrelation method of pitch
estimation but is simpler computationally, using differences and absolute
magnitudes instead of multiplications. Six bits are used to code the pitch
value.
For voicing, two decisions are made for each frame, one at the be-
ginning and another at the end of the frame. The decision takes into
account the low frequency energy, the zero-crossing count, the first two
reflection coefficients, and the ratio of the AMDF maximum to mini-
mum. The first reflection coefficient provides a measure of spectral tilt.
For voiced speech, the lower frequency components are of greater magni-
tude than the higher, resulting in a significant spectral tilt. The second
reflection coefficient is computed from lowpass (800 Hz) filtered speech
and is a measure of spectral peakedness. A strong peak character in the
low frequency spectrum is an indicator of voicing. The voicing decision
is performed by a linear discriminant classifier. The classifier adapts
to different input noise levels by estimating the input SNR and using
different coefficients in the linear combination of terms. Reference [15]
details the voicing classifier.
Both the pitch estimates and the voicing decisions are adjusted with
dynamic programming. For the pitch values, the dynamic programming
tracking removes occurrences of pitch halving or doubling errors. The
dynamic programming reduces spurious switching between voicing deci-
sions.
The LPC analysis is tenth order and applies the covariance method to
estimate the parameters. A Cholesky decomposition (see Section 4.2.2)
solves the system of equations. The LPC coefficients are represented as