Sie sind auf Seite 1von 7

MPEG-4 CELP

PRATIK PANCHAL (08BEC054) Department of Electronics and Communication, Nirma University.

Introduction
Speech coding has been a common area of research in signal processing since the introduction of wire-based telephones. Numerous speech coding techniques have been thoroughly researched and developed, spurned further by the advances in internet, technology and wireless communication. Speech coding is a fundamental element of digital communications, continuously attracting attention due to the increase of demands in telecommunication services and capabilities. Applications of speech coders for signal processing purposes has improved at a very fast pace throughout the years in order to allow it to take advantage of the increasing capabilities of communication technology infrastructure and computer hardware.

CELP (Code Excited Linear Prediction) is a speech coding algorithm for MPEG-4. It is
a lossy compression algorithm used for low bit rate (bits/second) speech coding. The CELP algorithm is based on four main ideas: Using the source-filter model of speech production through linear prediction (LP). Using an adaptive and a fixed codebook as the input (excitation) of the LP model. Performing a search in closed-loop in a perceptually weighted domain. Applying Vector Quantization (VQ).

CELP in Detail
Linear Predictive Coding Model Physical model:

When you speak:


Air is pushed from your lung through your vocal tract and out of your mouth comes speech. For certain voiced sound, your vocal cords vibrate (open and close). The rate at which the vocal cords vibrate determines the pitch of your voice. Women and young children tend to have high pitch (fast vibration) while adult males tend to have low pitch (slow vibration). For certain fricatives and plosive (or unvoiced) sound, your vocal cords do not vibrate but remain constantly opened. The shape of your vocal tract determines the sound that you make. As you speak, your vocal tract changes its shape producing different sound. The shape of the vocal tract changes relatively slowly (on the scale of 10 msec to 100 msec). The amount of air coming from your lung determines the loudness of your voice.

Mathematical Model:

The model says that the digital speech signal is the output of digital filter (LPC) whose output is either a train of impulses or a white noise sequence. The relationship between the physical and mathematical model.

Vocal Tract H(z) (LPC Filter) Air u(n) (Innovations) Vocal Cord Vibration V (voiced) Vocal Cord Vibration Period T (pitch period) Fricatives and Plosives UV (unvoiced) Air Volume G (gain)

LPC synthesis:
LPC filter is given by:

Input output relationship difference equation which is similar to upper one is given by:

The LPC Model can be represented in vector form as

Digital speech signals are sampled at rate of 8000samples/sec. so, A changes at 20msec or so. At a sampling rate 8000 samples/sec, 20msec means 160 samples. The digital speech signal is divided into frames of size 20 msec. There are 50 frames /second. So the model says that A is equivalent to,

So the 160 values of S is compactly represented by 13 values of A. There's almost no perceptual difference in S if:

For Voiced Sounds (V): The impulse train is shifted (insensitive to phase change). For Unvoiced Sounds (UV): A different white noise sequence is used.

CELP ENCODER
The main principle behind CELP is called Analysis-by-Synthesis (AbS) and means that the encoding (analysis) is performed by perceptually optimizing the decoded (synthesis) signal in a closed loop. In theory, the best CELP stream would be produced by trying all

possible bit combinations and selecting the one that produces the best-sounding decoded signal. This is obviously not possible in practice for two reasons: the required complexity is beyond any currently available hardware and the "best sounding" selection criterion implies a human listener. In order to achieve real-time encoding using limited computing resources, the CELP search is broken down into smaller, more manageable, sequential searches using a simple perceptual weighting function. Typically, the encoding is performed in the following order:

Linear Prediction Coefficients (LPC) are computed and quantized, usually as LSPs. The adaptive (pitch) codebook is searched and its contribution removed The fixed (innovation) codebook is searched.

Block Diagram of CELP Encoder (4.8 kbps)

The pitch prediction filter is given by,

T can be an integer or fraction.

Most modern audio codecs attempt to shape the coding noise so that it appears mostly in the frequency regions where the ear cannot detect it. For example, the ear is more tolerant to noise in parts of the spectrum that are louder and vice versa. That's why instead of minimizing the simple quadratic error, CELP minimizes the error for the perceptually weighted domain. The perceptual weighting filter

For good choice


and

Each frame is divided into 4 sub frames. In each sub frame, each codebook contains 512 code vectors. The gain is quantized using 5 bits per sub frame. At 30msec per frame , 4.8 is equivalent to 144 bits/frame.

How 144? It is as follows.

CELP DECODER

Figure 1 describes a generic CELP decoder. The excitation is produced by summing the contributions from an adaptive (aka pitch) codebook and a stochastic (aka innovation or fixed) codebook:

where ea[n] is the adaptive (pitch) codebook contribution and ef[n] is the stochastic (innovation or fixed) codebook contribution. The fixed codebook is avector quantization dictionary that is (implicitly or explicitly) hard-coded into the codec. This codebook can be algebraic (ACELP) or be stored explicitly (e.g.Speex). The entries in the adaptive codebook consist of delayed versions of the excitation. This makes it possible to efficiently code periodic signals, such as voiced sounds. The filter that shapes the excitation has an all-pole model of the form 1 / A(z), where A(z) is called the prediction filter and is obtained using linear prediction (LevinsonDurbin algorithm). An all-pole filter is used because it is a good representation of the human vocal tract and because it is easy to compute.

Pitch Prediction
During voiced segments, the speech signal is periodic, so it is possible to take advantage of that property by approximating the excitation signal e[n] by a gain times the past of the excitation:

T=Pitch period, =pitch gain, TN

Innovation Codebook
Final excitation e[n] is the sum of pitch prediction p[n] and the innovation code c[n] from the fixed codebook.

The final signal can be obtained from pitch prediction or linear prediction and it is,

Applications

CELP is examined for the coding of speech signals. The methodology is developed to allow for a joint optimization of waveform selection, waveform scaling, and pitch filter determination. Methods to accommodate high-pitch speakers (pitch lag smaller than the analysis frame size) are given. The requirements for coding the synthesis parameters into a bit stream at 4.8 kb/s are discussed. The coder has been tested at channel error rates of 0.001, with only minor degradations in the resulting speech. An adaptive postfilter has been added to achieve a small increase in perceived speech quality. CELP is used in GSM/CDMA based handsets for speech coding. Because of the flexibility permitted by CDMA to easily convey the information stream over a VR physical channel, the fixed-rate constraint has been removed from the speech coding algorithm design, in order to exploit the time-varying local character of speech.

References

http://en.wikipedia.org/wiki/Code-excited_linear_prediction http://www.speex.org/docs/manual/speex-manual/node9.html http://www.data-compression.com/speech.html

Das könnte Ihnen auch gefallen