Sie sind auf Seite 1von 65

HIGH THROUGHPUT

MULTISTANDARD TRANSFORM CORE


REALISATION USING CSDA IN VERILOG HDL
A PROJECT REPORT
Submitted by

SELVARANI. K

(951911106080)

SUJITHA. M

(951911106093)

SWARNAMUKI. R (951911106096)

in partial fulfillment for the award of the degree


of

BACHELOR OF ENGINEERING
in
ELECTRONICS AND COMMUNICATION ENGINEERING
P.S.R.ENGINEERING COLLEGE, SIVAKASI-626140

ANNA UNIVERSITY: CHENNAI 600 025

APRIL 2015

ANNA UNIVERSITY: CHENNAI 600 025


BONAFIDE CERTIFICATE
Certified that this project report HIGH
THROUGHPUT
MULTISTANDARD
TRANSFORM
CORE REALISATION USING CSDA IN VERILOG HDL
is the bonafide work of SELVARANI.K (951911106080),
SUJITHA.M (951911106093) and SWARNAMUKI.R
(951911106096), who carried out the project under my
supervision.

SIGNATURE

SIGNATURE

C.K.RAMAR, M.E.,

Mrs.J.MEENA, M.E.,

HEAD OF THE DEPARTMENT

SUPERVISOR
Assistant professor

Electronics and communication

Electronics and communication

Engineering

Engineering

P.S.R Engineering college ,

P.S.R Engineering College,

Sivakasi-626 140

Sivakasi-626 140

Submitted for the Viva Voce to be held on: _______________

INTERNAL EXAMINER

EXTERNAL EXAMINER

ii

ACKNOWLEDGEMENT
First and foremost we wish to express our deep unfathomable
feeling, gratitude to our institution and our department for providing us a
chance to fulfill our long cherished of becoming Electronics and
Communication Engineers.
We thank our beloved correspondent Mr.R.Solaisamy for his
support in every staff our college for their contribution in the growth of
this project.
We wish to express our hearty thanks to the principal of our
college Dr.B.G.Vishnuram, M.E.,Ph.D.,FIE., for his constant
motivation and continual encouragement regarding our project work.
We are greatly indebted to our Head of Department
Mr.C.K.Ramar,M.E., for his sincere help, and the encouragement he has
given towards the accomplishment of this project work.
We express our warm and sincere thanks to our guide
Mrs.J.Meena, M.E., Assistant professor/Electronics and communication
Engineering, for her tireless and meticulous efforts in bringing out this
project to its logical conclusion.
We are committed to place our heartfelt thanks to all teaching and
non teaching staff members, lab technician and friends, and all the noble
hearts that gave us immense encouragement towards the completion of
our project.

iii

ABSTRACT
To compress the video signals with very small delay and small area
in an efficient manner is completely a challenging technique in VLSI by
using Verilog HDL .This project proposes creating an architecture for
compress a video signals which supports three different types of video
codec

standards

are

MPEG-1/2/4(88),

H.264(88,44),

VC-

1(88,84,48,44) through the single core. Compression technique


involves Transformation, Truncation, Encoding process. Combination of
factor sharing and distributed arithmetic which results in new concept
called Common Sharing Distributed Arithmetic algorithm. This
efficiently reduces the utilization of number of adders in proposed Multi
Standard Transform core instead of using the multipliers and also
involves the new concept of ECAT which efficiently reduce the
truncation errors obtain from CSDA when compared to DCT. The
proposed system consists of Buffer for storage instead of Pipeline
Registers, with TMEM the inverse of the matrix can be constructed which
achieves 2D-CSDA from 1D-CSDA. At low cost the MST core can be
constructed with low area, delay, power. Extensive simulation are
conducted in Modelsim simulator to evaluate the notable performances
and the entire schematic diagram are viewed using RTL schematic in
Xilinx software.

iv

TABLE OF CONTENT

CHAPTER

TITLE

NO

PAGE
NO

ABSTRACT

IV

LIST OF FIGURES

LIST OF TABLE

XI

LIST OF ABBREVIATIONS

XII

INTRODUCTION

1.1 Video Compression

1.2 Lossy Compression

1.3 Advantage of video compression

1.4 Development in the field of VLSI

1.4.1 Reconfigurable computing

1.4.2 Takeover of Hardware design

1.4.3 The need for hardware compilers

1.5 Design Methodology

1.6 Objective And Scope

1.7 Applications

1.8 Recent Research In Video Compression

VIDEO CODECS

11

2.1 Video Codec Design

11

2.2 Different Standards

14

2.2.1 MPEG 1/2/4

14

2.2.1.1 MPEG-1

14

2.2.1.2 MPEG-2

15

2.2.1.3 MPEG-4

15

2.2.2 H.264

16

2.2.3 VC-1

17

SYSTEM ANALYSIS

18

3.1 Project Introduction

18

3.2 Existing system

19

3.2.1 Introduction

19

3.2.2 CSDA

20

3.2.3 Limitations of existing system

20

vi

3.3 Proposed System

20

3.3.1 Introduction

20

3.3.2 Buffer as a memory

20

3.3.3 Advantages of proposed system

21

3.4 Derivation of CSDA Algorithm

21

3.4.1 Factor sharing derivation

21

3.4.2 Distributed arithmetic format

21

3.4.3 CSDA Algorithm

22

3.5 Flow Diagram

24

3.5.1 Description

24

3.6 Modules

25

3.6.1 1-D Common Sharing


Distributed Arithmetic-MST
3.6.2 Even part common sharing distributed
arithmetic circuit
3.6.3 Odd part common sharing distributed
arithmetic circuit

25

26

27

3.6.4 ECAT

28

3.6.5 Permutation

29

3.7 2-D CSDA core design

30

vii

3.7.1 Mathematical Derivation of Eight-Point


and Four-Point Transforms
3.7.2 TMEM
4

30
32

SYSTEM IMPLEMENTATION

33

4.1 Xilinx ISE Overview

33

4.1.1 Design Entry

33

4.1.2 Synthesis

33

4.1.3 Implementation

33

4.1.4 Verification

34

4.1.5 Device Configuration

34

4.2 ModelSim Overview

34

4.3 Project flow

36

4.4 Multiple library flow

37

4.5 Debugging tools

37

4.6 VERILOG

38

RESULT ANALYSIS

40

5.1 Comparision With Existing Systems

40

viii

5.2 MUX Selection Inputs

41

5.3 MPEG Simulation Result

42

5.4 H.264 Simulation Result

43

5.5 VC-1 Simulation Result

44

5.6 RTL Schematic View Of Entire Process

45

5.7 Synthesis Report For Output

48

5.8 Power Analyzer Output

49

5.9 Device Utilization Summary

55

CONCLUSION

51

REFERENCE

52

ix

LIST OF FIGURES

FIGURE NO

TITLE

PAGE NO

3.1

Flow Diagram of CSDA

24

3.2

Architecture of the proposed 1-D CSDA-MST

26

3.3

Architecture of the even part CSDA circuit

27

3.4

Architecture of the odd part CSDA circuit

28

3.5

Architecture of ECAT

29

3.6

Permutation Concept

29

3.7

2D CSDA core with TMEM

30

3.8

TMEM

32

5.1

Simulation result for MPEG

42

5.2

Simulation result for H.264

43

5.3

Simulation result for VC1

44

5.4

RTL view of whole 2D-CSDA Architecture

45

5.5

RTL inner view of 2D-CSDA

46

5.6

RTL inner view of 1D-CSDA

47

5.7

Output for 2-D CSDA MST delay

48

5.8

Output for 2-D CSDA MST power

49

5.9

Output for 2-D CSDA MST gate counts

50

LIST OF TABLES

TABLE NO

TITLE

PAGE NO

3.1

Corresponding Dimensions of different

18

standards of Video Codes

5.1

Measured Results

40

5.2

Selection inputs for different standards

41

xi

LIST OF ABBREVATION

ABBREVATIONS

ACRONYMS

ASICS

Application Specific Integrated Circuits

CAD

Computer Aided Design

CMOS

Combinational Metal Oxide Semiconductor

CSDA

Common Sharing Distributed Arithmetic

DA

Distributed Arithmetic

ECAT

Error Compensated Tree

FS

Factor Sharing

IT

Integer Transform

MPEG

Moving Picture Expert Group

MST

Multi Standard Transform core

NEDA

New Distributed Arithmetic

RTL

Resistor Transistor Level

SOC

Silicon On Chip

TMEM

Memory Transpose

VC

Video Codec

VHDL

Very

High

Speed

Integrated

Hardware Description Language

xii

Circuits

CHAPTER 1
INTRODUCTION
Compression can mainly done by using several transforms such as
Discrete Cosine Transform, Integer Transforms, Distributed Arithmetic,
Factor sharing in video and image signals. These transforms are mainly
used as matrix decomposition methods to reduce the Hardware cost as
well as the implementation cost, but the implementation of such
transforms may be tedious in some cases especially for making a single
compatible architecture for different types of standards. In this project a
new technique which involves in supporting three video coding standards
has to be implemented.
1.1

VIDEO COMPRESSION
Video compression uses modern coding techniques to reduce

redundancy in video data. Video is nothing but the continuous motion of


the frames or images obtained from a moving object which is to be
considered. Most video compression algorithms and codecs combine
spatial image compression and temporal motion compensation technique
with some encoding features to secure the data. Video compression is a
practical implementation of source coding in information theory. In
practice, most video codecs also use audio compression techniques in
parallel to compress the separate, but combined data streams as one
package. The majority of video compression algorithms use lossy
compression which is best one to reduce the delay. Uncompressed
video requires

very

high data

rate.

Although lossless

video

compression codecs perform an average compression of over factor 3, a


typical MPEG-4 lossy compression video has a compression factor
between 20 and 200. As in all lossy compression, there is a trade1

off between video qualities, cost of processing the compression and


decompression, and system requirements. Highly compressed video may
present visible or distracting artifacts.
Some video compression schemes typically operate on squareshaped groups of neighboring pixels, often called macroblocks. These
pixel groups or blocks of pixels are compared from one frame to the next,
and the video compression codec sends only the differences within those
blocks. In areas of video with more motion, the compression must encode
more data to keep up with the larger number of pixels that are changing.
Commonly during explosions, flames, flocks of animals, and in some
panning shots, the high-frequency detail leads to quality decreases or to
increases in the variable bitrate.
1.2

LOSSY COMPRESSION
In information technology, "lossy" compression is the class of data

encoding methods that uses inexact approximations (or partial data


discarding) for representing the content that has been encoded. Such
compression techniques are used to reduce the amount of data that would
otherwise be needed to store, handle, and/or transmit the represented
content. The different versions of the photo of the cat at the right
demonstrate how the approximation of an image becomes progressively
coarser as more details of the data that made up the original image are
removed. The amount of data reduction possible using lossy compression
can often be much more substantial than what is possible with lossless
data compression techniques.
Using well-designed lossy compression technology, a substantial
amount of data reduction is often possible before the result is sufficiently
degraded to be noticed by the user. Even when the degree of degradation
2

becomes noticeable, further data reduction may often be desirable for


some applications (e.g., to make real-time communication possible
through a limited bit-rate channel, to reduce the time needed to transmit
the content, or to reduce the necessary storage capacity).
Lossy compression is one of most commonly used to
compress multimedia data (audio, video, and still images), especially in
applications such as streaming media and internet telephony. By contrast,
lossless compression is typically required for text and data files, such as
bank records and text articles. In many cases it is advantageous to make a
master lossless file that can then be used to produce compressed files for
different purposes; for example, a multi-megabyte file can be used at full
size to produce a full-page advertisement in a glossy magazine, and a
10 kilobyte lossy copy can be made for a small image on a web page.
1.3

ADVANTAGES OF VIDEO COMPRESSION


The main advantage of compression is that it reduces the data

storage requirements. It also offers an attractive approach to reduce the


communication cost in transmitting high volumes of data over long-haul
links via higher effective utilization of the available bandwidth in the data
links. This significantly aids in reducing the cost of communication due
to the data rate reduction. Because of the data rate reduction, data
compression also increases the quality of multimedia presentation
through limited-bandwidth communication channels. Hence the audience
can experience rich-quality signals for audio-visual data representation.
For example, because of the sophisticated compression technologies we
can receive toll-quality audio at the other side of the globe through the
good old telecommunications channels at a much better price compared
to a decade ago. Because of the significant progress in image
3

compression techniques, a single 6 MHz broadcast television channel can


carry HDTV signals to provide better quality audio and video at much
higher rates and enhanced resolution without additional bandwidth
requirements. The rate of input-output operations in a computing device
can be greatly increased due to shorter representation of data.
In systems with levels of storage hierarchy, data compression in
principle makes it possible to store data at a higher and faster storage
level (usually with smaller capacity), thereby reducing the load on the
input-output channels. Data compression obviously reduces the cost of
backup and recovery of data in computer systems by storing the backup
of large database files in compressed form. The advantages of data
compression will enable more multimedia applications with reduced cost
and hence aid its usage by a larger population with newer applications in
the near future.
1.4

DEVELOPMENTS IN THE FIELD OF VLSI


There are a number of directions a person can take in VLSI, and

they are all closely related to each other. Together, these developments
are going to make possible the visions of embedded systems and
ubiquitous computing.
1.4.1 Reconfigurable computing
Reconfigurable computing is a very interesting and pretty recent
development in microelectronics. It involves fabricating circuits that can
be reprogrammed on the fly! And no, we are not talking about
microcontrollers

running

with

EEPROM

inside.

Reconfigurable

computing involves specially fabricated devices called FPGA's, that when


programmed act just like normal electronic circuits. They are so designed

that by changing or "reprogramming" the connections between numerous


sub modules, the FPGA's can be made to behave like any circuit we wish.
This fantastic ability to create modifiable circuits again opens up
new

possibilities

in

microelectronics.

Consider

for

example,

microprocessors which are partly reconfigurable. We know that running


complex programs can benefit greatly if support was built into the
hardware itself. We could have a microprocessor that could optimise
itself for every task that it tackled! Or then consider a system that is too
big to implement on hardware that may be limited by cost, or other
constraints. If we use a reconfigurable platform, we could design the
system so that parts of it are mapped onto the same hardware, at different
times. One could think of many such applications, not the least of which
is prototyping - using an FPGA to try out a new design before it is
actually fabricated. This can drastically reduce development cycles, and
also save some money that would have been spent in fabricating
prototype IC's
1.4.2 Takeover of Hardware design
ASIC's provided the path to creating miniature devices that can do
a lot of diverse functions. But with the impending boom in this kind of
technology, what we need is a large number of people who can design
these IC's. This is where we realise that we cross the threshold between a
chip designer and a systems designer at a higher level. Does a person
designing a chip really need to know every minute detail of the IC
manufacturing process? Can there be tools that allow a designer to simply
create

design

specifications

that

specifications?

get

translated

into

hardware

The solution to this is rather simple - hardware compilers or silicon


compilers as they are called. We know by now, that there exist languages
like VHDL which can be used to specify the design of a chip. What if we
had a compiler that converts a high level language into a VHDL
specification? The potential of this technology is tremendous - in simple
manner, we can convert all the software programmers into hardware
designers.
1.4.3 The need for hardware compilers
Before we go further let us look at why we need this kind of
technology that can convert high-level languages into hardware
definitions. We see a set of needs which actually lead from one to the
other in a series.
A.

Rapid development cycles


The traditional method of designing hardware is a long and

winding process, going through many stages with special effort spent in
design verification at every stage. This means that the time from drawing
board to market, is very long. This proves to be rather undesirable in case
of large expanding market, with many competitors trying to grab a share.
We need alternatives to cut down on this time so that new ideas reach the
market faster, where the first person to get in normally gains a large
advantage.
B.

Large number of designers


With embedded systems becoming more and more popular, there is

a need for a large number of chip designers, who can churn out chips
designed for specific applications. Its impractical to think of training so
many people in the intricacies of VLSI design.
6

C.

Specialized training
Person who wishes to design ASIC's will require extensive training

in the field of VLSI design. But we cannot possibly expect to find a large
number of people who would wish to undergo such training. Also, the
process of training these people will itself entail large investments in time
and money. This means there has to be system which can abstract out all
the details of VLSI, and which allows the user to think in simple systemlevel terms.
There are quite a few tools available for using high-level languages
in circuit design. But this area has started showing fruits only recently.
For example, there is a language called Handel-C, that looks just like
good old C. But it has some special extensions that make it usable for
defining circuits. A program written in Handel-C, can be represented
block-by-block by hardware equivalents. And in doing all this, the
compiler takes care of all low-level issues like clock-frequency, layout,
etc. The biggest selling point is that the user does not really have to learn
anything new, except for the few extensions made to C, so that it may be
conveniently used for circuit design.
Another quite different language, that is still under development, is
Lava. This is based on an esoteric branch of computer science, called
"functional programming". FP itself is pretty old, and is radically
different from the normal way we write programs. This is because it
assumes parallel execution as a part of its structure - its not based on the
normal idea of "sequence of instructions". This parallel nature is
something very suitable for hardware since the logic circuits are is
inherently parallel in nature. Preliminary studies have shown that Lava
can actually create better circuits than VHDL itself, since it affords a
high-level view of the system, without losing sight of low-level features.
7

1.5

DESIGN METHODOLOGY
A good VLSI design system should provide for consistent in all

three description domains (behavioral, structural, and physical) and at


level of abstraction (e.g. architecture, RTL/block, logic, circuit).The
means by which this is to be measured in various terms that differ in
importance based on their application. These parameters can be
summarized in terms of
Performance-Speed, power, flexibility
Size of die (Cost of die)
Time to design (Cost of engineering and schedule)
Ease of verification, Test generation and testability (Cost of
engineering and schedule)
Design is a continuous trade off to achieve the adequate results for
all of the above parameters .So that the tools and methodologies used for
the particular chip will be functioning based on the above parameters. But
other constraints depends on economics (i.e., size of die affecting yield)
are even subjectivity.
The process of designing a system on silicon is complicated, the
role of good VLSI-design aids is to reduce this complexity, increase the
productivity, and assure that designer of the working product. A good
method of simplifying the approach to a design by the use of constraints
and abstraction. The design method in contrast to the design flow used to
built a chip. The base design method are arranged in roughly in order of
increase investment, which loosely relates to time and cost it takes to
design and implement the system. It is important to understand the cost,
capabilities and limitations of a given implementation technology to

select the right solution. To design a custom chip when an off-the-shelf


solution that meet the system criteria is available for same or lower cost.
1.6

OBJECTIVE AND SCOPE

This project deals with a MST core that supports, H.264 (8 8, 4 4)


MPEG-1/2/4 (8 8), and VC-1 (8 8, 8 4, 4 8, 4 4) transforms. The
proposed MST core employs Distributed algorithm and Factor Sharing
schemes as common sharing distributed arithmetic (CSDA) to reduce
hardware cost.
Our new design of multi standard transform video codec
architecture will be in high throughput, low area and low delay.
1.7

APPLICATIONS
Digital video codecs are found in DVD systems (players,

recorders), Video CD systems, in emerging satellite and digital terrestrial


broadcast systems, various digital devices and software products with
video recording or playing capability. Online video material is encoded
by a variety of codecs, and this has led to the availability of codec packs a
pre-assembled set of commonly used codecs combined with an installer
available as a software package for PCs, such as K-Lite Codec Pack.
Encoding media by the public has seen an upsurge with the availability of
CD and DVD recorders.
1.8

RECENT RESEARCH IN VIDEO COMPRESSION


Although the imminent death of research into video compression

has often been proclaimed, the growth in capacity of telecommunications


networks is being outpaced by the rapidly increasing demand for services.
The result is an ongoing need for better multimedia compression, and
particularly video and image compression.
9

At the same time, there is a need for these services to be carried on


networks of greatly varying capacities and qualities of service, and to be
decoded by devices ranging from small, low-power, handheld terminals
to much more capable fixed systems. Hence, the ideal video compression
algorithm have high compression efficiency, be scalable to accommodate
variations in network performance including capacity and quality of
service, and be scalable to accommodate variations in decoder capability.
In this presentation, these issues will be examined, illustrated by recent
research at UNSW@ADFA in compression efficiency, scalability and
error resilience.

10

CHAPTER 2
VIDEO CODECS
2.1

VIDEO CODEC DESIGN


A video codec is a device or software that enables compression or

decompression of digital video; the format of the compressed data


adheres to a video compression specification. The compression is usually
lossy. Historically, video was stored as an analog signal on magnetic tape.
Around the time when the compact disc entered the market as a digitalformat replacement for analog audio, it became feasible to also begin
storing and using video in digital form, and a variety of such technologies
began to emerge. Audio and video call for customized methods of
compression which may leads to new trend in telecommunication
wireless systems. Engineers and mathematicians have tried a number of
solutions for tackling this problem.
There is a complex relationship between the video quality, the
quantity of the data needed to represent it (also known as the bit rate), the
complexity of the encoding and decoding algorithms, robustness to data
losses and errors, ease of editing, random access, and end-to-end delay.
Video codecs seek to represent a fundamentally analog data set in a
digital format. Because of the design of analog video signals, which
represent luma and color information separately, a common first step in
image compression in codec design is to represent and store the image in
a Y,Cb,Cr color space. The conversion to Y,Cb,Cr provides two benefits:
first, it improves compressibility by providing decorrelation of the color
signals; and second, it separates the luma signal, which is perceptually
much more important, from the chroma signal, which is less perceptually
important and which can be represented at lower resolution to achieve

11

more efficient data compression. It is common to represent the ratios of


information stored in these different channels in the following way
Y:Cb:Cr. Refer to the following article for more information about
Chroma subsampling.
Different codecs will use different chroma subsampling ratios as
appropriate to their compression needs. Video compression schemes for
Web and DVD make use of a 4:2:0 color sampling pattern, and the DV
standard uses 4:1:1 sampling ratios. Professional video codecs designed
to function at much higher bitrates and to record a greater amount of
color information for post-production manipulation sample in 3:1:1
(uncommon), 4:2:2 and 4:4:4 ratios. Examples of these codecs include
Panasonic's DVCPRO50 and DVCPROHD codecs (4:2:2), and then
Sony's HDCAM-SR (4:4:4) or Panasonic's HDD5 (4:2:2). Apple's new
Progress HQ 422 codec also samples in 4:2:2 color space. More codecs
that sample in 4:4:4 patterns exist as well, but are less common, and tend
to be used internally in post-production houses. It is also worth noting
that video codecs can operate in RGB space as well. These codecs tend
not to sample the red, green, and blue channels in different ratios, since
there is less perceptual motivation for doing so just the blue channel
could be under sampled.
Some amount of spatial and temporal down sampling may also be
used to reduce the raw data rate before the basic encoding process. The
most popular such transform is the 8x8 discrete cosine transform (DCT).
Codecs which make use of a wavelet transform are also entering the
market, especially in camera workflows which involve dealing with
RAW image formatting in motion sequences. The output of the transform
is first quantized, then entropy encoding is applied to the quantized
values. When a DCT has been used, the coefficients are typically scanned
12

using a zig-zag scan order, and the entropy coding typically combines a
number of consecutive zero-valued quantized coefficients with the value
of the next non-zero quantized coefficient into a single symbol, and also
has special ways of indicating when all of the remaining quantized
coefficient values are equal to zero. The entropy coding method typically
uses variable-length coding tables. Some encoders can compress the
video in a multiple step process called n-pass encoding (e.g. 2-pass),
which performs a slower but potentially better quality compression.
The decoding process consists of performing, to the extent
possible, an inversion of each stage of the encoding process. The one
stage that cannot be exactly inverted is the quantization stage. There, a
best-effort approximation of inversion is performed. This part of the
process is often called "inverse quantization" or "dequantization",
although quantization is an inherently non-invertible process.
This process involves representing the video image as a set of
macroblocks. For more information about this critical facet of video
codec design.
Video codec designs are often standardized or will be in the futurei.e., specified precisely in a published document. However, only the
decoding process needs to be standardized to enable interoperability. The
encoding process is typically not specified at all in a standard, and
implementers are free to design their encoder however they want, as long
as the video can be decoded in the specified manner. For this reason, the
quality of the video produced by decoding the results of different
encoders that use the same video codec standard can vary dramatically
from one encoder implementation to another.

13

2.2

DIFFERENT STANDARDS
In this project three different types of standards has to be

considered they are


MPEG 1/2/4
H.264
VC-1
2.2.1 MPEG 1/2/4
The MPEG standards consist of different Parts. Each part covers a
certain

aspect

of

the

whole

specification. The

standards

also

specify Profiles and Levels. Profiles are intended to define a set of tools
that are available, and Levels define the range of appropriate values for
the properties associated with them. Some of the approved MPEG
standards were revised by later amendments and/or new editions. MPEG
has standardized the following compression formats and ancillary
standards:
2.2.1.1

MPEG-1

Coding of moving pictures and associated audio for digital storage


media at up to about 1.5 Mbit/s (ISO/IEC 11172). The first MPEG
compression standard for audio and video. It is commonly limited to
about 1.5 Mbit/s although the specification is capable of much higher bit
rates. It was basically designed to allow moving pictures and sound to be
encoded into the bitrate of a Compact Disc. It is used on Video CD and
can be used for low-quality video on DVD Video. It was used in digital
satellite/cable TV services before MPEG-2 became widespread. To meet
the low bit requirement, MPEG-1 downsamples the images, as well as
uses picture rates of only 2430 Hz, resulting in a moderate quality. It
14

includes the popular MPEG-1 Audio Layer III (MP3) audio compression
format.
2.2.1.2

MPEG-2

Generic coding of moving pictures and associated audio


information (ISO/IEC 13818). Transport, video and audio standards for
broadcast-quality television. MPEG-2 standard was considerably broader
in scope and of wider appeal supporting interlacing and high definition.
MPEG-2 is considered important because it has been chosen as the best
compression

scheme

for

over

which

includes

the

air digital

television ATSC, DVB and ISDB, digital satellite TV services like Dish
Network, digital cable television signals, SVCD and DVD Video. It is
also used on Blu-ray Discs, but these normally use MPEG-4 Part 10 or
SMPTE VC-1 for high-definition content.
2.2.1.3

MPEG-4

Coding of audio-visual objects (ISO/IEC 14496) MPEG-4 uses


further coding tools with additional complexity to achieve higher
compression factors than MPEG-2. In addition to more efficient coding
of video, MPEG-4 moves closer to computer graphics applications. In
more complex profiles, the MPEG-4 decoder effectively becomes a
rendering processor and the compressed bit stream describes threedimensional shapes and surface texture. MPEG-4 supports Intellectual
Property Management and Protection (IPMP), which provides the facility
to use proprietary technologies to manage and protect content like digital
rights management. It also supports MPEG-J, a fully programmatic
solution for creation of custom interactive multimedia applications (Java
application environment with a Java API) and many other features.

15

2.2.2 H.264
H.264 or MPEG-4 Part 10, Advanced Video Coding (MPEG-4
AVC) is a video compression format that is currently one of the most
commonly used formats for the recording, compression, and distribution
of video content. H.264/MPEG-4 AVC is a block-oriented motioncompensation-based video compression standard developed by the ITUT Video Coding Experts Group (VCEG) together with the ISO/IEC
JTC1 Moving Picture Experts Group (MPEG).
H.264 is perhaps best known as being one of the video encoding
standards for Blu-ray Discs; all Blu-ray Disc players must be able to
decode H.264. It is also widely used by streaming internet sources, such
as videos from Vimeo, YouTube, and the iTunes Store, web software
such as the Adobe Flash Player and Microsoft Silverlight, H.264 is
typically used for lossy compression in the strict mathematical sense,
although the amount of loss may sometimes be imperceptible. It is also
possible to create truly lossless encodings using it e.g., to have localized
lossless-coded regions within lossy coded pictures or to support rare use
cases for which the entire encoding is lossless.
The intent of the H.264/AVC project was to create a standard
capable of providing good video quality at substantially lower bit rates
than previous standards (i.e., half or less the bit rate of MPEG-2, H.263,
or MPEG-4 Part 2), without increasing the complexity of design so much
that it would be impractical or excessively expensive to implement. An
additional goal was to provide enough flexibility to allow the standard to
be applied to a wide variety of applications on a wide variety of networks
and systems, including low and high bit rates, low and high resolution
video, broadcast, DVD storage, RTP/IP packet
T multimedia telephony systems.
16

networks,

and ITU-

The H.264 standard can be viewed as a "family of standards"


composed of the profiles described below. A specific decoder decodes at
least one, but not necessarily all profiles. The decoder specification
describes which profiles can be decoded. The H.264 name follows
the ITU-T naming convention, where the standard is a member of the
H.26x line of VCEG video coding standards.
2.2.3 VC-1
VC-1 is an evolution of the conventional DCT-based video codec
design also found in H.261, MPEG-1 Part 2, H.262/MPEG-2 Part
2, H.263, and MPEG-4 Part 2. It is widely characterized as an alternative
to the ITU-T and MPEG video codec standard known as H.264/MPEG-4
AVC. VC-1 contains coding tools for interlaced video sequences as well
as progressive encoding. The main goal of VC-1 Advanced Profile
development and standardization was to support the compression of
interlaced content without first converting it to progressive, making it
more attractive to broadcast and video industry professionals.
Both HD DVD and Blu-ray Disc have adopted VC-1 as a video
standard, meaning their video playback devices will be capable of
decoding and playing video-content compressed using VC-1. Windows
Vista partially supports HD DVD playback by including the VC-1
decoder and some related components needed for playback of VC-1
encoded HD DVD movies.

17

CHAPTER 3
SYSTEM ANALYSIS
3.1

PROJECT INTRODUCTION
Compression can mainly done by using several transforms such as

Discrete Cosine Transform, Integer Transforms, Distributed Arithmetic,


Factor sharing in video and image signals. These transforms are mainly
used as matrix decomposition methods to reduce the Hardware cost as
well as the implementation cost. Swartzland and Yu present an efficient
method for reducing ROMs size by using Recursive DCT algorithms.
For scaling purpose ROMs are better but some other circuits with
shrinking technology nodes. Numerous ROM free DA architecture have
been emerged recently. A new DA sharing system called NEDA which
involves bit level sharing scheme to implement the butterfly matrix based
on the adders. These are used to support anyone of the application
standards (Table 3.1).
Table 3.1 Corresponding Dimensions Of Different Video Codecs

Video Codecs

Dimensions

Groups

MPEG 1/2/4

88

ISO

H.264

88,44

ITU-T

VC-1

88,84,48,44

Microsoft

The DFT is the most important discrete transform, used to


perform Fourier analysis in many practical applications. In digital signal
18

processing, the function is any quantity or signal that varies over time,
such

as

the

pressure

of

a sound

wave,

a radio signal,

or

daily temperature readings, sampled over a finite time interval (often


defined by a window function). In image processing, the samples can be
the values of pixels along a row or column of a raster image. The DFT is
also used to efficiently solve partial differential equations, and to perform
other operations such as convolutions or multiplying large integers
likewise DCT and all other transforms has the advantages to increase the
throughput rate.
3.2

EXISTING SYSTEM

3.2.1 INTRODUCTION
Numerous researchers have worked on transform core designs,
including discrete cosine transform (DCT) and integer transform, using
distributed arithmetic (DA) , factor sharing (FS)

and matrix

decomposition methods to reduce hardware cost. The inner product can


be implemented using ROMs and accumulators instead of multipliers to
reduce the area cost. To improve the throughput rate of the NEDA
method, high-throughput adder trees are introduced. FS method derives
matrices for multistandards as linear combinations from the same matrix
and delta matrix, and show that the coefficients in the same matrix can
share the same hardware resources. Matrices for VC-1 transformations
can be decomposed into several small matrices. Recently, reconfigurable
architectures have been presented as a solution to achieve a good
flexibility of processors in field-programmable gate array (FPGA)
platform or application-specific integrated circuit (ASIC). These all
existed methods fully supported transform core for the H.264 standard,
including 8 8 and 4 4 transforms. The eight-point and four-point
19

transform cores for MPEG-1/2/4 and H.264 and VC-1 cannot support the
VC-1 compression standard. To overcome this limitation the proposed
system exists.
3.2.2 CSDA
CSDA means Common Sharing Distributed Arithmetic, it is the
technique that combines the Factor sharing and Distributed Arithmetic to
generate the CSDA coefficients. Factor sharing means sharing the same
factors from the existing input and Distributed Arithmetic means sharing
the same input coefficients. In existing system pipeline register is used as
a storage element.
3.2.3 LIMITATIONS OF EXISTING SYSTEM
Low throughput
High cost
High delay
More number of adders
3.3

PROPOSED SYSTEM

3.3.1 INTRODUCTION
The proposed CSDA combines DA and FS methods. By expand
the coefficients matrix at bit level The Factor sharing method first shares
the same factor in each coefficient ,the distributed method is then applied
to share the same combination of Input among each coefficient position.
3.3.2 BUFFER AS A MEMORY
In proposed system instead of pipeline register buffer is used as a
memory element. Buffer is active only when the clock input is high. The
20

usage of buffer here makes the bit stream without getting any halt in the
memory. Hence the delay is considerably reduced. There is no storage in
the register which makes the retrieval time must be very small.
3.3.3 ADVANTAGES OF PROPOSED SYSTEM
High throughput
Low cost
Supports three different types of video codecs
Reduction in number of adders
3.4

DERIVATION OF CSDA ALGORITHM


The CSDA algorithm mainly combines Factor Sharing and

Distributed Arithmetic techniques. The methods for computing the


coefficients are given below.
3.4.1 Factor sharing derivation
In this technique the signals having the same factors that
has to be shared. If the signals S1 and S2 can be written as
(

(3.1)

Where Fs (shared factor) and Fd1(remainder coefficients) can be found in


the coefficients C1 and C2,respectively
3.4.2 Distributed Arithmetic format
For matrix multiplication and accumulation the inner product can
be written as
21

(3.2)

Where Ai is an Nbit CSD coefficients and Xi is an input data.

] [

]
(

(3.3)

The product Y can be obtained by shifting and adding every Yj which is


the nonzero value. The inner product can be obtained by using shifters
and adders instead of using multipliers to implementing in low cost.
3.4.3 CSDA Algorithm
The inner product can be a product of inputs and coefficient

[ ]

][

(3.4)

(3.5)

This section provides a discussion of the hardware resources and


system accuracy for the proposed 2-D CSDA-MST core and also presents
a comparison with previous works. Finally, the characteristics of the
implementation into a chip are described.
22

The coefficients can be generated as the above matrix, the values


can be compared and the shame factors i.e,[1 -1] that has to be shared and
finally calculate the value Fs. From the shared factor the distributed
arithmetic values can be considered with the help of the inputs X i. The
CSDA combines the factor sharing and DA methods. The FS method is
implemented to first identify the factors that can achieve the greater
hardware resource sharing capacity. The shared factor FS in four
coefficients is [ 1 -1] and C1 ~ C2 can use instead of

[1 -1] with the

corresponding position under the FS method. The Distributed Arithmetic


is applied to share the same position for the input, and the DA shared
coefficient DA1=(X1+X2)

FS . Finally, the matrix inner product in

above equation can be implemented by shifting and adding every nonzero


weight position.
To adopt searching flow software code is the only way to iterative
searching loops by setting a constraint with minimum number of nonzero
elements. The choice of shared coefficients is obtained by some
23

constraints ,the coefficients is not a global optimal solution which have


the minimal non zero bits.
3.5

FLOW DIAGRAM
Input coefficient
matrix

FS finds new shared factor in


coefficient matrix.
DA finds shared coefficient Based on
FS results

Calculate the numbers of the adders.

Compare to previous data (adders),


and Update the smallest one for FS
and DA
Iteration Searching Loop

Find the CSDA


shared coefficient

Fig 3.1 CSDA flow diagram


3.5.1 DESCRIPTION:
To obtain better resource sharing for inner product operation, the
proposed CSDA combines the FS and DA methods. The FS method is
adopted rst to identify the factors that can achieve higher capability in
24

hardware resource sharing, where the hardware resource in this paper is


dened as the number of adder usage. Next, the DA method is used to
nd the shared coefcient based on the results of the FS method. The
adder-tree circuits will be followed by the proposed CSDA circuit. Thus,
the CSDA method aims to reduce the nonzero elements to as few as
possible. The CSDA shared coefcient is used for estimating and
comparing thereafter the number of adders in the CSDA loop. Therefore,
the iteration searching loop requires a large number of loops to determine
the smallest hardware resource by these steps, and the CSDA shared
coefcient can be established. Notice that the optimal factor or coefcient
in only FS or DA methods is not conducted for the smallest resource in
the proposed CSDA method. Thus, a number of iteration loops is needed
for determining the better CSDA shared coefcient.
3.6

MODULES

3.6.1 1-D Common Sharing Distributed arithmetic-MST


Based on the proposed CSDA algorithm, the coefficients for
MPEG-1/2/4, H.264, and VC-1 transforms are chosen to achieve high
sharing capability for arithmetic resources. To adopt the searching flow,
software code will help to do the iterative searching lop by setting a
constraint with minimum nonzero elements. In this paper, the constraint
of minimum nonzero elements is set to be five. After

software

searching, the coefficients of the CSD expression, where 1 indicates 1.


Note that this choice of shared coefficient is obtained by some
constraints. Thus, the chosen CSDA coefficient is not a global optimal
solution. It is just a local or suboptimal solution. Besides, the CSD codes
are not optimal expression, which have minimal nonzero bits.

25

Fig 3.2. Architecture of the proposed 1-D CSDA-MST.


3.6.2 Even part common sharing distributed arithmetic circuit
The SBF module executes for the eight-point transform and
bypasses the input data for two four-point transforms. After the SBF
module, the CSDA_E and CSDA_O execute and by feeding input data a
and b, respectively. The CSDA_E calculates the even part of the eightpoint transform, similar to the four-point Trans form for H.264 and VC-1
standards. Within the architecture of CSDA_E, two pipeline stages exist
(12-bit and 13-bit). The first stage executes as a four-input butterfly
matrix circuit, and the second stage of CSDA_E then executes by using
the proposed CSDA algorithm to share hardware resources in variable
standards.

26

1st stage

2nd stage

memory

memory

Fig.3.3 Architecture of the even part CSDA circuit


3.6.3 Odd part common sharing distributed arithmetic circuit:
Similar to the CSDA_E, the CSDA_O also has two pipeline stages.
Based on the proposed CSDA algorithm, the CSDA_O efficiently shares
the hardware resources among the od part of the eight-point transform
and four-point transform for variable standards. It contains selection
signals of multiplexers (MUXs) for different standards. Eight adder trees
with error compensation (ECATs) are followed by the CSDA_E and
CSDA_O, which ad the nonzero CSDA coefficients with corresponding
weight as the tree-like architectures. The ECATs circuits can alleviate
truncation error efficiently in small area design when summing the
nonzero data al together.

27

1st stage

memory

2nd stage

memory

Fig.3.4. Architecture of the odd part CSDA circuit.


3.6.4 ECAT
Eight adder trees with error compensation (ECATs) are followed
by the CSDA_E and CSDA_O, which add the nonzero CSDA
coefficients with corresponding weight as the tree-like architectures. The
ECATs circuits can alleviate truncation error efficiently in small area
design when summing the nonzero data all together.

28

Fig.3.5. Architecture of ECAT


3.6.5 Permutation
In 8 output from ECAT directly given to permutation.
permutation relates to the act of rearranging, or permuting, all the
members of a set into some sequence or order (unlike combinations,
which are selections of some members of the set where order is
disregarded).It is used to for encode output matrix.

Fig.3.6 Permutation concept

29

3.7

2D CSDA CORE DESIGN

Fig.3.7 2D CSDA core with TMEM


3.7.1 Mathematical Derivation of Eight-Point and Four-Point
Transforms
This section introduces the proposed 2-D CSDA-MST core
implementation. Neglecting the scaling factor, the one- dimensional (1-D)
eight-point transform can be dened as follows

(3.6)

(3.7)
Because the eight-point coefcient structures in MPEG- 1/2/4,
H.264, and VC-1 standards are the same, the eight-point transform for
these standards can use the same mathematic derivation. According to the
30

symmetry property, the 1-D eight- point transform can be divided into
even and odd two four-point transforms, Ze and Zo, as listed in and
respectively

(3.8)

The even part of the operation in (10) is the same as that of the four-point
H.264 and VC-1 transformations. Moreover, the even part Ze can be
further decomposed into even and odd parts: Zee and Zeo

(3.9)

31

3.7.2 TMEM
The TMEM is implemented using 64-word 12-bit dual-port buffer
and has a latency of 52 cycles. Based on the time scheduling strategy and
result of the time scheduling strategy, the 1st-D and 2nd-D transforms are
able to be computed simultaneously. The transposition memory is an 88
buffer array with the data width of 16 bits and is shown in Fig

Fig.3.8 TMEM

32

CHAPTER 4
SYSTEM IMPLEMENTATION
4.1

Xilinx ISE Overview


The Integrated Software Environment (ISE) is the Xilinx

design software suite that allows you to take your design from design
entry through Xilinx device programming. The ISE Project Navigator
manages and processes your design through the following steps in the
ISE design flow.
4.1.1 Design Entry
Design entry is the first step in the ISE design flow. During design
entry, you create your source files based on your design objectives. You
can create your top-level design file using a Hardware Description
Language (HDL), such as VHDL, Verilog, or ABEL, or using a
schematic. You can use multiple formats for the lower-level source files
in your design.
4.1.2 Synthesis
After design entry and optional simulation, you run synthesis.
During this step, VHDL, Verilog, or mixed language designs become net
list files that are accepted as input to the implementation step.
4.1.3 Implementation
After synthesis, you run design implementation, which converts the
logical design into a physical file format that can be downloaded to the
selected target device. From Project Navigator, you can run the
implementation process in one step, or you can run each of the
implementation processes separately. Implementation processes vary

33

depending on whether you are targeting a Field Programmable Gate


Array (FPGA) or a Complex Programmable Logic Device (CPLD).
4.1.4 Verification
You can verify the functionality of your design at several points in
the design flow. You can use simulator software to verify the
functionality and timing of your design or a portion of your design. The
simulator interprets VHDL or Verilog code into circuit functionality and
displays logical results of the described HDL to determine correct circuit
operation. Simulation allows you to create and verify complex functions
in a relatively small amount of time. You can also run in-circuit
verification after programming your device.
4.1.5 Device Configuration
After generating a programming file, you configure your device.
During configuration, you generate configuration files and download the
programming files from a host computer to a Xilinx device.
4.2

ModelSim Overview
ModelSim is a very powerful simulation environment, and as such

can be difficult to master. Thankfully with the advent of Xilinx Project


Navigator 6.2i, the Xilinx tools can take care of launching ModelSim to
simulate most projects. However, a rather large flaw in Xilinx Project
Navigator 6.2i is its inability to correctly handle test benches which
instantiate multiple modules. To correctly simulate a test bench which
instantiates multiple modules, you will need to create and use a
ModelSim project manually. The steps are fairly simple:
1. Create a directory for your project
2. Start ModelSim and create a new project
34

3. Add all your verilog to the project


4. Compile your verilog files
5. Start the simulation
6. Add signals to the wave window
7. Recompile changed verilog files
8. Restart/Run the simulation
ModelSim is a simulation and debugging tool for VHDL, Verilog,
and mixed-language designs.
4.2.1 Basic simulation flow
The following diagram shows the basic steps for simulating a
design in ModelSim
4.2.2 Creating the working library
In ModelSim, all designs, be they VHDL, Verilog, or some
combination thereof, are compiled into a library. You typically start a
new simulation in ModelSim by creating a working library called "work".
"Work" is the library name used by the compiler as the default destination
for compiled design units.
4.2.3 Compiling your design
After creating the working library, you compile your design units
into it. The ModelSim library format is compatible across all supported
platforms. You can simulate your design on any platform without having
to recompile your design.

35

4.2.4 Running the simulation


With the design compiled, you invoke the simulator on a top-level
module (Verilog) or a configuration or entity/architecture pair (VHDL).
Assuming the design loads successfully, the simulation time is set to zero,
and you enter a run command to begin simulation.
4.2.5 Debugging your results
If you dont get the results you expect, you can use ModelSims
robust debugging environment to track down the cause of the problem.
4.3

Project flow
A project is a collection mechanism for an HDL design under

specification or test. Even though you dont have to use projects in


ModelSim, they may ease interaction with the tool and are useful for
organizing files and specifying simulation settings. The following
diagram shows the basic steps for simulating a design within a ModelSim
project.As you can see, the flow is similar to the basic simulation flow.
However, there are two important differences:
You do not have to create a working library in the project flow; it is
done for you automatically.
Projects are persistent. In other words, they will open every time you
invoke ModelSim unless you specifically close them.
4.4

Multiple library flow

ModelSim uses libraries in two ways:


1) As a local working library that contains the compiled version of your
design;
2) As a resource library.
36

The contents of your working library will change as you update


your design and recompile. A resource library is typically static and
serves as a parts source for your design. You can create your own
resource libraries, or they may be supplied by another design team or a
third party (e.g., a silicon vendor).
You specify which resource libraries will be used when the design
is compiled, and there are rules to specify in which order they are
searched. A common example of using both a working library and a
resource library is one where your gate-level design and test bench are
compiled into the working library, and the design references gate-level
models in a separate resource library.
The diagram below shows the basic steps for simulating with
multiple libraries.
You can also link to resource libraries from within a project. If you
are using a project, you would replace the first step above with these two
steps: create the project and add the test bench to the project.
4.5

Debugging tools
ModelSim offers numerous tools for debugging and analyzing your

design. Several of these tools are covered in subsequent lessons,


including:
Setting breakpoints and stepping through the source code
Viewing waveforms and measuring time
Viewing and initializing memories
A project may also consist of,
HDL source files or references to source files
Other files such as READMEs or other project documentation
37

Local libraries
References to global libraries
4.6

VERILOG
Verilog, standardized as IEEE 1364, is a hardware description

language (HDL) used to model electronic systems. It is most commonly


used in the design and verification of digital circuits at the registertransfer level of abstraction. It is also used in the verification of analog
circuits and mixed-signal circuits.
Verilog HDL is one of the two most common Hardware
Description Languages (HDL) used by integrated circuit (IC) designers.
The other one is VHDL. HDLs allows the design to be simulated earlier
in the design cycle in order to correct errors or experiment with different
architectures. Designs described in HDL are technology-independent,
easy to design and debug, and are usually more readable than schematics,
particularly for large circuits.
Verilog can be used to describe designs at four levels of
abstraction:
(i) Algorithmic level (much like c code with if, case and loop
statements).
(ii) Register transfer level (RTL uses registers connected by
Boolean equations).
(iii) Gate level (interconnected AND, NOR etc.).
(iv) Switch level (the switches are MOS transistors inside gates).
The language also defines constructs that can be used to control the
input and output of simulation. More recently Verilog is used as an input
for synthesis programs which will generate a gate-level description (a
38

netlist) for the circuit. Some Verilog constructs are not synthesizable.
Also the way the code is written will greatly effect the size and speed of
the synthesized circuit. Most readers will want to synthesize their circuits,
so no synthesizable constructs should be used only for test benches.
These are program modules used to generate I/O needed to simulate the
rest of the design. The words not synthesizable will be used for
examples and constructs as needed that do not synthesize.
There are two types of code in most HDLs:
Structural, which is a verbal wiring diagram without storage.
assign a=b & c | d; /* | is a OR */
assign d = e & (~c);
Here the order of the statements does not matter. Changing e will change
a.
Procedural which is used for circuits with storage, or as a
convenient way to write conditional logic.
always @(posedge clk) // Execute the next statement on every
rising clock edge.
count <= count+1;
Procedural code is written like c code and assumes every
assignment is stored in memory until over written. For synthesis, with
flip-flop storage, this type of thinking generates too much storage.
However people prefer procedural code because it is usually much easier
to write, for example, if and case statements are only allowed in
procedural code. As a result, the synthesizers have been constructed
which can recognize certain styles of procedural code as actually
combinational.
39

CHAPTER 5
RESULT ANALYSIS

30K

27K

88

44(L)

44(H)

88

84

48

44

Power Consumption

38.7m

3.4Mw

N/A

N/A

N/A

46.3mW

26m

(mW)

88

H.264

Supporting
Standards
VC-1

Table 5.1

Measured Results

- represents non supported standards


-represents supported standards

40

CSDA

Proposed

36.8K

CSDA

39.1K

MPEG 1/2/4

Lee et al.

36.6K

Gate counts(NAND2)

al.

Chang et

55.6K

Measured results

al.

39.8K

Huang et

Lee et al.

Exsisting

COMPARISION WITH EXISTING SYSTEMS


Lai et al.

5.1

While comparing the proposed system with existing system the


usage of buffer instead of pipeline register will considerably reduce the
delay which increases the speed with respect to the number of reduction
in the adders from the Table 7.1. The value of gate counts can be reduced
to 27k but in the existing system the value is quite higher 30k. Since only
few numbers of adders utilized which reduces the power consumption
which is of 26mw, while the existing system involves in high power
consumption of 46mw. The proposed system also supports the multiple
standards.
From the above table the measured results are compared with the
existing results .In a proposed CSDA the gate count is reduced when
compared with the existing CSDAThe power consumption measured in
our proposed CSDA is 26mW which is reduced when compared with our
existing CSDA.
5.2

MUX SELECTION INPUTS


Table 5.2 Selection Inputs For Different Standards

Video

Dimensions MUX MUX- MUX- MUX- MUX- MUX- MUX-

codec

standards
MPEG

H.264

VC-1

4(H)

4(I)

41

These are the selection inputs which are given to the individual standards.
The desired standard can be obtained using the MUX selection.
5.3

MPEG SIMULATION RESULT


By giving the selection inputs as binary inputs for seven mux as

(1111101) for eight point transform we get the MPEG output simulation
as shown in fig.7.1

Fig.5.1 simulation result for MPEG


42

5.4

H.264 SIMULATION RESULT


By giving the selection inputs as binary inputs for seven mux as

(1001000) for eight point transform and (0000011) for four point
transform we get the H.264 output simulation as shown in fig.7.2

Fig.5.2 simulation result for H.264

43

5.5

VC-1 SIMULATION RESULT


By giving the selection inputs as binary inputs for seven mux as

(1001000) for eight point transform and (0000011) for four point
transform we get the VC-1 output simulation as shown in fig.7.3

Fig.5.3 simulation result for VC-1

44

5.6

RTL SCHEMATIC VIEW OF ENTIRE PROCESS


In a synthesis results Run by Xilinx 13.2 software. The proposed

MST core employs Distributed algorithm and Factor Sharing schemes as


common sharing distributed arithmetic (CSDA) to reduce hardware cost
and delay.

Fig.5.4 RTL view of whole 2D-CSDA architecture

45

Fig 5.5 RTL inner view of 2D CSDA

46

Fig5.6 .RTL inner view of 1D CSDA

47

5.7

SYNTHESIS REPORT FOR OUTPUT

Fig.5.7 Output for 2-D Common Sharing Distributed


arithmetic-MST delay

48

5.8

POWER ANALYZER OUTPUT

Fig.5.8 Output for 2-D Common Sharing Distributed


arithmetic-MST Power

49

5.9

DEVICE UTILIZATION SUMMARY

Fig.5.9 Output for 2-D Common Sharing Distributed


arithmetic-MST Gate count

50

CHAPTER 6
CONCLUSION
The CSDA-MST core can achieve high performance, with a high
throughput rate and low-cost VLSI design, supporting MPEG-1/2/4,
H.264, and VC-1 MSTs. By using the proposed CSDA method, the
number of adders and MUXs in the MST core can be saved efficiently.
Measured results show the CSDA-MST core with a synthesis and
simulation rate with 27k logic gates and with power consumption of
26mW. Measured results show the CSDA-MST core with a throughput
rate of 1.28 G-pixels/s, which can support (4928 2048@24 Hz) digital
cinema format with only 27k logic gates. Because visual media
technology has advanced rapidly, this approach will help meet the rising
high-resolution specifications and future needs as well.

51

REFERENCES

1.

Chang.H, Kim.S, Lee.S, and Cho.K, , Nov[ 2009], Design of


area-efficient unified transform circuit for multi-standard
video decoder, in Proc. IEEE Int. SoC Design Confpp. 369
372.

2.

Chen.Y.H, Chang.T.Y, and Li.C.Y, Apr[ 2011]. High


throughput DA-based DCT with high accuracy errorcompensated adder tree, IEEE Trans. Very Large Scale
Integration. (VLSI) Syst., vol. 19, no. 4, pp. 709714.

3.

Hoang.D.T and Vitter.J.S,[ 2001]. Efcient Algorithms for


MPEG Video Compression. New York, USA: Wiley.

4.

Huang.C.Y, Chen.L.F, and Lai.Y.K, May [2008].

A high-

speed 2-D transform architecture with unique kernel for


multi-standard video applications, in Proc. IEEE Int. Symp.
Circuits Syst., pp. 2124.
5.

Hwangbo.W and Kyun.C.M, Apr.[ 2010]. A multitransform


architecture for H.264/AVC high-prole coders, IEEE
Trans. Multimedia, vol. 12, no. 3, pp. 157162.

6.

Lai.Y.K and Lai.Y.F Aug. [2010]. A reconfigurable IDCT


architecture for universal

video decoders, IEEE Trans.

Consum. Electron., vol. 56, no. 3, pp. 1872187.

52

7.

Lee.S and Cho.K , Feb. [2008]. Architecture of transform


circuit for video decoder supporting multiple standards,
Electron. Lett, vol. 44, no. 4, pp. 2742758.

8.

Uramoto.S, Inoue.Y, Takabatake.A, Takeda.J,Yamashita.Y,


Terane.T, and Yoshimoto.M, Apr [1992]. A 100-MHz 2-D
discrete cosine transform core processor, IEEE J. Solid-State
Circuits, vol. 27, no. 4, pp. 492499.

53

Das könnte Ihnen auch gefallen