Sie sind auf Seite 1von 4

Real-time VLSf Architecture for Video Compression

0. Fatemi and S. Panchanathan


Visual Computing and Communications Laboratory Department of Electrical Engineering University of Ottawa Ottawa, Ontario Canada, K1N 6N5 E-mail: o.fatemi@ieee.org Fax #: (613) 562-5175

Abstract- Video compression is becoming increasingly important in several applications. Vector quantization (VQ) is a powerful technique for very low bit rate imagehide0 compression and is an attractive technique for mobile multimedia applications. Adaptive VQ techniques provide an excellent coding performance at the expense of significant increases in computational complexity making real-time implementation difficult. In this' paper, we propose a VLSI chip-set design to implement a high performance cache based (adaptive) VQ (CVQ) technique using VHDL for real-time video compression.

1. IN TRODUCTI0N
Vector quantization (VQ) is becoming increasingly popular in the domain of very low bit rate imageivideo compression. VQ is particularly attractive in applications such as mobile multimedia, video conferencing, video telephony, etc. In VQ[l], a set of representative images is decomposed into L-dimensional vectors. An iterative clustering algorithm such as the LBG algorithm is used to generate a codebook (CB) of size N. This codebook is then made available at both the transmitter and the receiver. In the encoding process, the image to be coded is decomposed into Ldimensional vectors. For each input vector V, , CB is searched using a nearest neighbor rule to find the closest codeword W,. Compression is achieved by transmitting the label j corresponding to W,. Reconstruction of images is implemented by usingj as an address to a table containing the codewords. The computational complexity of V Q for K input vectors of dimension L and a codebook size ,V is O(KLN). For example, a 5 12 x 5 12 image with vector dimension of L = 16 encoded using a codebook of size N = 256 requires approximately 192 million arithmetic operations. This high computational complexity has

been an impediment for real time implementation in many applications. Recently, special purpose architectures that implement VQ in real time have been reported in the literature[2] . Several adaptive VQ techniques which improve the coding performance have also been proposed in the literature. However, most adaptive techniques result in further increases in computational complexity making real time implementation difficult. Recently, adaptive VQ techniques[3] based on the cache concept hake been reported. For example, a cache VQ technique (CVQ) has been presented in the literatureE41, where ~ W V O codebooks are used namely: a small primary codebook (PC) and a larger secondary codebook (SC). The frequently used codewords are stored in PC while the less frequently used codewords are stored in SC. To start with, both PC and SC are empty. For each input vector, PC is first searched for a match within a prespecified threshold (Figure I). If no match is obtained, the input vector is transmitted and is also appended to PC as a new codeword. If a match is obtained the index of the corresponding codeword is transmitted. When PC becomes full, the least recently used (LRU) codeword is moved from PC to SC freeing room for the new codeword. From this point on. PC is searched first for a match, however, if it fails. SC is also searched. A new codeword is appended only if no match is obtained in both codebooks, and in this case, the LRU codeword in PC is moved to SC. However, if SC is also full, the LRU codeword in SC is deleted. If a match is obtained in SC, the index of that codenord is transmitted and the codeword is swapped with the LRG codeword in PC. However, we note that this algorithm and other cache based VQ algorithms cannot be directly mapped onto the existing architectures. since a software implementation of the LRU replacement algorithm may degrade the real-time performance of CVQ. In this paper, we propose a VLSI chip-set design which implements both the CVQ algorithm and the

CCECUCCGEI '95

0-7803-2766-7-91951$4.00 1995 IEEE 0

7.4

129

LRU algorithm in real-time. The circuit has been built and tested using VHDL. The details of architecture are discussed in section 11. In section 111. the VHDL implementation of the design is presented followed by the conclusions in section IV.

Figure 2-The basic cell for SM.

Figure 1- CVQ Algorithm

II.CVQ ARCHITECTURE
The design of CVQ architecture consists of five main modules: 0 Input delay module (IM) 0 Systolic array module (Ski) 0 LRU module (LM) 0 Comparator module (CM) 0 Output delay module (OM) The details of each module follows.

There are two SM modules in the CVQ design: 0 SIM, LxN, cells for the PC of size X , . with 0 SM2 with LxN2 cells for the SC of size NZ. The block diagram of the SM with LTiV basic cells is shown in Figure 3.

A.Systolic Array Module (SM)


This module is the most complex module of the design and utilizes most of the execution time and 97% of the chip area. The core of the module is a I!dimensional systolic array which calculates the distance between the input vector and all of the codewords in the codebook (Equation l).The module exploits the existing parallelism in the direction of the codebook dimension, N, and in the direction of the vector dimension, L.

Figure 3- The block diagram of SivI

Equation 1

The basic cell in S M is shown in Figure 2. It calculates the distortion between an element of the input vector with the corresponding element of a codeword and accumulates the distortion. (Equation 2).

WI,, p , , 1 = w,,4 > c p , , - , ) + TIK,, - Cp., s


Equation 2

The CLK signal synchronizes the operation of the cell. The element of the codeword, C,, is stored in the RAM cell, C. The input value of the vector element is also sent to the output as V,,, . The sequence of operations can be expressed as follows:

The systolic array executes the encoding algorithm. Here, each input vector element, V,,, (I=1,2, ....K: and j=1,2, ...,L), is compared with the corresponding element of each codeword, C,,, (p=l,&..iV ) stored in the array ( the RAM'S are in the read mode). We note that the elements of the vector are pumped into the array at intervals of one clock cycle. The distortion, CD, , is the cumulative distortion value of the ith input vector compared with the pth codeword. This is then fed to a comparator module to select the best codeword. The CLK signal also serves to synchronize the operations of the cells in the array. In the first clock cycle, the input vector element V,,, is fed into the cell Cl.,. In the next clock cycle the element is fed into C,,z, while the second element of the input vector. VI.: is fed to C1,? and so on. After L clock cycles the last , element of the input vector, is fed to the cell C1.L. The element VI,, is compared with the CN., the Nth at clock cycle. The sequence of operations is best understood by the cell occupancy diagram (Figure 4). which shows the cells occupied by the elements of the vector at different instants of time.

130

-I 1

Figure 4-The cell occupancy diagram

0. We note that this algorithm can be mapped onto a pipeline which generates the LRU label at every clock cycle. The schematic diagram of the LRU module is shown in Figure 7. The architecture comprises of a chain of identical processing elements (PES), a decoder, an encoder and a register to store the RP. Each PE consists of a flip-flop and some combinatlonal logic. The implementation is modular and is easily expandable. Details of the design are presented in [ 5 ] .

Becomparator module (CM)


This module consists of iV label cells, which are connected in a pipeline. The design of the C M module and the comparator cell are shown in Figure 5 and Figure 6 , respectively.

Figure 5- The block diagram of CM

RU codeword index

Figure 6- The cell in the CM module

Figure 7 - The block diagram of LM

Each cell consists of a comparator and a label register. The cell determines the label using the following algorithm. If CD,, 2 TD,,
else

D..Input Delay Module (IM) and Output Delay Module (OM)


The IM schedules the vector elements to be input the systolic array. This task is executed using delay buffer cells connected together as pipelines. Each cell passes its input to the output at every clock cycle. The fust element of the input vector is fed to Sic1 without any delay (Figure 8), while the second input is fed with a delay of one clock cycle, and so on. I consists of i M L(L-1)/2 delay cells.

Lo,, = i; TD,,, = C D , where i is the label stored in the label register.

C . LRU module (LM)


We recall from section I that the least recently used codeword is selected to update the codebook in CVQ. We have implemented a simple and efficient design of the LRU approximation algorithm[5]. We note that the LRU approximation algorithm has a reduced complexity compared to LRU algorithm with minimal degradation in performance. Here, a usage bit UB, associated with each word i and a removal pointer RP are used to find the LRU codeword. whenever the ith codeword is referenced its usage bit UB; is set to 1. To determine the index of the LRU codeword, the usage bits are examined in sequence. If UBi is 1, it is reset to 0. Otherwise (UB, = 0), we know that codeword i has not been referenced since the last time UBi was reset to

Oh1 Figure 8- The I and the OIM modules M

131

The OM reorganizes the input vector so that all of its elements are output in one clock cycle. This module (Figure 8) is simply the mirror image of IM.

IV. Conclusion
Vector quantization (VQ) is an excellent technique

111. VHDL Implementation


A behavioral VHDL description of the design has

for very low bit rate imagehide0 compression and is attractive for mobile multimedia applications. CVQ is a
powerful adaptive VQ technique, and provides an excellent coding performance at a reduced complexity. In this paper, we have presented a VLSI chip-set design to implement the CVQ technique using VHDL for real-time video compression. A behavioral VHDL description of the design has been implemented using the synthesizable part of the VHDL language. Timing analysis demonstrates that this chip set is suitable for real-time video compression.

been implemented using the synthesizable part of the VHDL language. The implementation is based on general values for the codebook size, N,and the vector dimension, L. After an initial latency which is L+mar(N,, IVJ clock cycles, the label of the first input vector becomes available. The labels for the subsequent input vectors are output at intervals of one clock cycle. The design, has been synthesized (translated and then optimized) and tested. The resulting chip area a,nd speed for the three basic cells are shown in table 1. We note that area and speed can be improved by using advanced technology libraries.

V.A cknowledgment
The authors would like to thank Mr. Robert Sawaya for his help in this project and the Ministry of Culture and Higher Education of the Islamic Republic of Iran for the financial support of this project.

area*

delay (ns) 43.84


1.39

VI.References
1)N.M. Nasrabadi and R. A. King, "Image C o c b g L'sing Vector Quantization: A Review", IEEE Trans. on Communications, Vol. COM-36, No. S, pp. 957-971, August 1988. 2) G. A. Davidson, P. R. Cappello and A. Gersho, "Systolic Archtectuxes for Vector Quantization",(\em IEEE Trans. Acoust., Speech, Signal Processing], Vol. ASP-36, pp. 163-1664, October 1988. 3) S. Panchanathan and M. Goldberg, "A Mini-Max Algorithm for Image Adaptive Vector Quantization", IEE P r o c e e h g s : Part I - Communications, Speech and Vision, Vol. 138, No. I,pp. 53-60, February 1991. 1) F. Idris and S. panchanathan, "Image Sequence Coding Using Frame Adaptive Vector Quantization", Visual Communications and Image Processing '93,vol. 2094, pp. 941952, November 1993 5 ) "FPGX Implementation Of The LRU Algorithm For Video Compression", 0. Fatemi, F. I& and S. Panchanathan, IEEE Transactions on Consumer Electronics Vol 10,Xo 3 pp. 337-344, August 1994

Table 1

* area is normalized to the equivalent of a nand2 gate


The minimum duration of the clock pulse is determined by the maximum of: + The time taken by. the Shl-cell to compute the distortion ( 43 84 ns ) + The time taken by the CiCI-cell to complete the comparison and label extraction ( 2 1.95 ns ) + The time taken by 1,M-cell or Obi-cell to pass the input to the output (1.39 ns ) Hence, the maximum frequency of operation is f=IN3.81=23MHz which is suitable for real-time video compression. Typical values for a video sequence with 5 12x5 12 images at the rate of 30 images per second are: L=16; iV,=8; iV2=32 => K=1638-/. The total number of vectors which should be encoded within one second is 491520. This will be satisfied very well by the implemented chip set which is capable of encoding this number of vectors in only 0.0215 second. The total area of the chip is S45*(L*NJ f 203*N, 56*L, where N,=N,+N2 .

Das könnte Ihnen auch gefallen