Main Report

1
1. INTRODUCTION
IMAGE scaling has been widely applied in the fields of digital imaging devices such as
digital cameras, digital video recorders, digital photo frame, high-definition television, mobile
phone, tablet PC, etc. An obvious application of image scaling is to scale down the high-quality
pictures or video frames to fit the minimize liquid crystal display panel of the mobile phone or
tablet PC. As the graphic and video applications of mobile handset devices grow up, the demand
and significance of image scaling are more and more outstanding. The image scal ing algorithms
can be separated into polynomial-based and non-polynomial-based methods. The simplest
polynomial-based method is a nearest neighbor algorithm. It has the benefit of low complexity,
but the scaled images are full of blocking and aliasing artifacts. The most widely used scaling
method is the bilinear interpolation algorithm [I], by which the target pixel can be obtained by
using the linear interpolation model in both of the horizontal and vertical directions. Another
popular polynomial-based method is the bicubic interpolation algorithm [15], which uses an
extended cubic model to acquire the target pixel by a 2-D regular grid. In recent years, many
high-quality non-polynomial-based methods [2][4] have been proposed. These novel methods
greatly improve image quality by some efficient techniques, such as curvature interpolation [2],
bilateral filter [3], and autoregressive model [4]. The methods mentioned earlier efficiently
enhance the image quality as well as reduce the artifacts of the blocking, aliasing, and blurring
effects. However, these high-quality image scaling algorithms have the characteristics of high
complexity and high memory requirement, which is not easy to be realized by VLSI technique.
Thus, for real-time applications, low-complexity image processing algorithms are necessary for
VLSI implementation [5][9].

To achieve the demand of real-time image scaling applications, some previous studies
[10][15] have proposed low-complexity methods for VLSI implementation. Kim et al.
Proposed the area - pixel model Winscale [10] , and Lin et al . Realized an efficient VLSI design
[11]. Chen et al. [12] also proposed an area-pixel-based scalar design advanced by an edge-
oriented technique. Lin et a/. [13], [14] presented a low-cost VLSI scalar design based on the
bicubic scaling algorithm.
In our previous work [15], an adaptive real-time, low-cost, and high-quality image scalar
was proposed. It successfully improves the image quality by adding sharpening spatial and clamp
2

filters as prefiters [5] with an adaptive technique based on the bilinear interpolation algorithm.
Although the hardware cost and memory requirement had been efficiently reduced, the demand
of memory still costs four line buffers. Hence, a low-cost and low-memory requirement image
scalar design.

3

2. EXISTING SYSTEM
2.1. VLSI Implementation of an Edge-Oriented Image Scaling Processor

IMAGE scaling is widely used in many fields [1][4], ranging from consumer lectronics
to medical imaging. It is indispensable when the resolution of an image generated by a source
device is different from the screen resolution of a target display. For example, we have to enlarge
images to fit HDTV or to scale them down to fit the mini-size portable LCD panel. The most
simple and widely used scaling methods are the nearest neighbor [5] and bilinear [6] techniques.
In recent years, many efficient scaling methods have been proposed in the literature [7][14].
According to the required computations and memory space, we can divide the existing scaling
methods [5][14] into two classes: lower complexity [5][8] and higher complexity [9][14]
scaling techniques. The complexity of the former is very low and comparable to conventional
bilinear method. The latter yields visually pleasing images by utilizing more advanced scaling
methods. In many practical real-time applications, the scaling process is included in end-user
equipment, so a good lower complexity scaling technique, which is simple and suitable for low-
cost VLSI implementation, is needed. In this paper, we consider the lower complexity scaling
techniques [5][8] only. Kim et al. presented a simple area-pixel scaling method in [7].
It uses an area-pixel model instead of the common point-pixel model and takes a
maximum of four pixels of the original image to calculate one pixel of a scaled image. By using
the area coverage of the source pixels from the applied mask in combination with the difference
of luminosity among the source pixels, Andreadis et al. [8] proposed a modified area-pixel
scaling algorithm and its circuit to obtain better edge preservation. Both [7] and [8] obtain better
edge-preservation but require about two times more of computations than the bilinear method.
To achieve the goal of lower cost, we present an edge-oriented area-pixel scaling
processor in this paper. The area-pixel scaling technique is approximated and implemented with
the proper and low-cost VLSI circuit in our design. The proposed scaling processor can support
floating-point magnification factor and preserve the edge features efficiently by taking into
account the local characteristic existed in those available source pixels around the target pixel.
Furthermore, it handles streaming data directly and requires only small amount of memory: one
line buffer rather than a full frame buffer. The experimental results demonstrate that the
4

proposed design performs better than other lower complexity image scaling methods [5][8] in
terms of both quantitative evaluation and visual quality.
The seven-stage VLSI architecture for the proposed design was implemented and
synthesized by using Verilog HDL and synopsis design compiler, respectively. In our simulation,
the circuit can achieve 200 MHz with 10.4-K gate counts by using TSMC 0.18- m technology.
Since it can process one pixel per clock cycle, it is quick enough to process a video resolution of
WQSXGA (3200 X 2048) at 30 f/s in real time.

DRAW BACKS:
The architecture works with monochromatic images,
The architecture works in RGA images very costly.
Very low processing rate of 200 megapixels/second.

5

2.2. A Low-Cost VLSI Implementation for Efficient Removal of Impulse Noise

IN SUCH applications as printing skills, medical imaging, scanning techniques, image
segmentation, and face recognition, images are often corrupted by noise in the process of image
acquisition and transmission. Hence, an efficient denoising technique is very important for the
image processing applications [1]. Recently, many image denoising methods have been proposed
to carry out the impulse noise suppression [2][17]. Some of them employ the standard median
filter [2] or its modifications [3], [4] to implement denoising process. However, these approaches
might blur the image since both noisy and noise-free pixels are modified. To avoid the damage
on noise-free pixels, an efficient switching strategy has been proposed in the literature [5][17].
In general, the switching median filter [5] consists of two steps: 1) impulse detection and 2)
noise filtering. It locates the noisy pixels with an impulse detector, and then filters them rather
than the whole pixels of an image to avoid the damage on noise-free pixels. Generally, the
denoising methods for impulse noise suppression can be classified into two categories: lower-
complexity techniques [6][13] and higher-complexity techniques [14][17]. The former uses a
fixed-size local window and requires a few line buffers. Furthermore, its computational
complexity is low and can be comparable to conventional median filter or its modification [2]
[4]. The latter yields visually pleasing images by enlarging local window size adaptively [15],
[16] or doing iterations [14][17]. In this paper, we focus only on the lower-complexity
denoising techniques because of its simplicity and easy implementation with the VLSI circuit.
In [6], Zhang and Karim proposed a new impulse detector (NID) for switching median
filter. NID used the minimum absolute value of four convolutions which are obtained by using 1-
D Laplacian operators to detect noisy pixels.A method named as differential rank impulse
detector (DRID) is presented in [7]. The impulse detector of DRID is based on a comparison of
signal samples within a narrow rank window by both rank and absolute value. In [8], Luo
proposed a method which can efficiently remove the impulse noise (ERIN) based on simple
fuzzy impulse detection technique. An alpha-trimmed meanbased method (ATMBM) was
presented in [9]. It used the alphatrimmed mean in impulse detection and replaced the noisy pixel
value by a linear combination of its original value and the median of its local window. In [10], a
decision-based algorithm (DBA) is proposed to remove the corrupted pixel by the median or by
its neighboring pixel value according the proposed decisions. For real-time embedded
6

applications, the VLSI implementation of switching median filter for impulse noise removal is
necessary and should be considered. For customers, cost is usually the most important issue
while choosing consumer electronic products. We hope to focus on low-cost denoising
implementation in this paper. The cost of VLSI implementation depends mainly on the required
memory and computational complexity. Hence, less memory and few operations are necessary
for a low-cost denoising implementation. Based on these two factors, we propose a simple edge-
preserved denoising technique (SEPD) and its VLSI implementation for removing fixed-value
impulse noise. The storage space needed for SEPD is two line buffers rather than a full frame
buffer. Only simple arithmetic operations, such as addition and subtraction, are used in SEPD.
We proposed a useful impulse noise detector to detect the noisy pixel and employ an effective
design to locate the edge of it. The experimental results demonstrate that SEPD can obtain better
performances in terms of both quantitative evaluation and visual quality than other state-of-the-
art lower-complexity impulse denoising methods [6][13]. Furthermore, the VLSI
implementation of our method also outperforms previous hardware circuits [11][13], [19], [20]
in terms of quantitative evaluation, visual quality, and hardware cost.

DRAW BACKS:
The architecture works in only monochromatic images,
The architecture not works in RGA images.
Poor performance in terms of quantitative evaluation and visual quality

7

2.3. VLSI ARCHITECTURE OF AN AREA EFFICIENT IMAGE INTERPOLATION

Image interpolation technique is a widely used scheme in image processing, medical
imaging, and computer graphics [1]. Image interpolation is a method of constructing new data
points within the range of a discrete set of known data points [2], [3]. Interpolation processes are
transformations between two habitually sampled grids, one at the input resolution, and another at
output resolution [4]. A variety of applications require image zooming, such as digital cameras,
electronic publishing, third-generation mobile phones, medical imaging, and image processing
[5]. An image resolution limit the scope to which zooming develops clarity, limits the quality of
digital photograph magnifications and in the circumstance of medical images can avert the
correct diagnosis. Lone image interpolation (zooming, up-sampling, or resizing) can
synthetically increase image resolution for displaying or printing, but is usually limited in
conditions of enhancing image precision, or revelling higher frequency substance. Image
interpolations based on estimates of the model sinc kernel (pixel replication, bilinear, bi-cubic,
and higher-order splines) are normally used for their flexibility and speed, though these methods
commonly to blurring, ringing artifacts, jagged edges, and abnormal depiction (curves of
substance intensity) [6]. But these methods can be optimized by updating the sinc-approximating
kernel to the image being interpolated.
In recent years many type of image interpolation techniques have been proposed. The
cubic convolution interpolation task which affords a good conciliation between the
computational complexity and rebuilding precision has been exploited [7]. In the midst of
various proposed interpolation algorithms, the simplest one is the nearest neighbour algorithm
[8]. It needs a quite low time complexity and a moderately easy implementation since of its only
selecting the nearest value from the nearest points as outcomes. Though, the images that are
interpolated by the nearest-neighbour method are complete of blocking and aliasing artifacts.
Another simplest method of interpolation is bi-linear algorithm which employs linear
interpolation form to compute unknown pixels. But it makes serious blurring problem [5].
Bicubic is one of the conventional image interpolation techniques. This method is attractive on
the aspect of algorithmic simplicity, which is highly desirable for fast implementation. But, this
method may introduce blurring and other annoying image artifacts especially around edges [9].
Extended linear convolution interpolation is proposed [10]. It can give high image quality
compared to bi-cubic interpolation. Another interpolation called the Error Amended Sharp Edge
8

(EASE) scheme is used to reduce the interpolation error which is based on bilinear interpolation
method [11].
In many practical real time applications, the interpolation process is included in the end
user equipment. It has become a considerable trend to design a low-cost, high class, and high
speed interpolation by the VLSI techniques from home appliances to medical image processing
[12]. VLSI hardware architecture for any application can be implemented by using FPGA. FPGA
is an integrated circuit designed to be configured by a customer or a designer often
manufacturing.
This work presents VLSI architecture of low-complexity image interpolation algorithm
based on convolution kernel [13] and EASE interpolation [11]. The remaining portion of this
paper is structured as follows. Chapter II gives a brief analysis of different image interpolation
techniques. Chapter III describes in detail the FPGA implementation of the proposed
interpolation architecture. Chapter IV compares the different VLSI optimization parameters such
as number of Look up tables (LUTs), power and combinational path delay of both existing
method [13] and proposed method.

DRAW BACKS:
The architecture works with high area, so its take high power
It is working with low accuracy.

9

3. PROPOSED SYSTEM
Fig. 1 shows the block diagram of the proposed scaling algorithm. It consists
of a sharpening spatial filter, a clamp filter, and a bilinear interpolation. The
sharpening spatial and clamp filters [6] serve as prefilters [5] to reduce blurring
and aliasing artifacts produced by the bilinear interpolation. First, the input pixels
of the original images are filtered by the sharpening spatial filter to enhance the
edges and remove associated noise. Second, the filtered pixels are filtered again by
the clamp filter to smooth unwanted discontinuous edges of the boundary regions.
Finally, the pixels filtered by both of the sharpening spatial and clamp filters are
passed to the bilinear interpolation for up-/ downscaling. To conserve computing
resource and memory buffer, these two filters are simplified and combined into a
combined filter. The details of each part will be described in the following
sections.

Fig. 1. Block diagram of the proposed scaling algorithm for image zooming.

10

Fig. 2. Weights of the convolution kernels. (a) 3 x 3 convolution kernel. (b) Cross-
model convolution kernel. (c) T-model and inversed T-model convolution kernels.

A. Low-Complexity Sharpening Spatial and Clamp Filters

The sharpening spatial filter, a kind of high-pass filter, is used to reduce
blurring artifacts and defined by a kernel to increase the intensity of a center pixel
relative to its neighboring pixels. The clamp filter [6], a kind of low-pass filter, is a
2-D Gaussian spatial domain filter and composed of a convolution kernel array. It
usually contains a single positive value at the center and is completely surrounded
by ones [15]. The clamp filter is used to reduce aliasing artifacts and smooth the
unwanted discontinuous edges of the boundary regions. The sharpening spatial and
clamp filters can be represented by convolution kernels. A larger size of
convolution kernel will produce higher quality of images. However, a larger size of
convolution filter will also demand more memory and hardware cost. For example,
a 6 x 6 convolution filter demands at least a five-line-buffer memory and 36
arithmetic units, which is much more than the two-line-buffer memory
11

and nine arithmetic units of a 3 x 3 convolution filter. In our previous work [15],
each of the sharpening spatial and clamp filters was realized by a 2-D 3 x 3
convolution kernel as shown in Fig. 2(a). It demands at least a four-line-buffer
memory for two 3 x 3 convolution filters. For example, if the image width is 1920
pixels, 4 x 1920 x 8 bits of
data should be buffered in memory as input for processing. To reduce the
complexity of the 3 x 1 convolution kernel, a cross-model formed is used to
replace the 3 x 3 convolution kernel, as shown in Fig. 2(b). It successfully cuts
down on four of nine parameters in the 3 x 3 convolution kernel. Furthermore, to
decrease more complexity and memory requirement of the crossmodel convolution
kernel, 1-model and inversed T-model convolution kernels are proposed for
realizing the sharpening spatial and clamp filters. As shown in Fig. 2(c), the T-
model convolution kernel is composed of the lower four parameters of the cross-
model, and the inversed T-model convolution kernel is composed of the upper four
parameters. In the proposed scaling algorithm, both the T-model and inversed T-
model filters are used to improve the quality of the images simultaneously. The T-
model or inversed T-model filter is simplified from the 3 x 3 convolution filter of
the previous work [15], which not only efficiently reduces the complexity of the
convolution filter but also greatly decreases the memory requirement from two to
one line buffer for each convolution filter. The T-model and the inversed T-model
provide the low-complexity and low memory-requirement convolution kernels for
the sharpening spatial and clamp filters to integrate the VLSI chip of the proposed
low-cost image scaling processor.

12

B. Combined Filter
In proposed scaling algorithm, the input image is filtered by a sharpening
spatial filter and then filtered by a clamp spatial filter again. Although the
sharpening spatial and clamp filters are simplified by Tmodels and inversed T-
models, it still needs two line buffers to store input data or intermediate values for
each T-model or inversed T-model filter. Thus, to be able to reduce more
computing resource and memory requirement, sharpening spatial and clamp filters,
which are formed by the 1-model or inversed T-model, should be combined
together into a combined filter as

where S and C are the sharp and clamp parameters and P(m,) is the filtered result
of the target pixel Pon,n) by the combined filter. A T-model sharpening spatial
filter and a T-model clamp filter have been replaced by a combined T-model filter
as shown in (1). To reduce the one-line-buffer memory, the only parameter in the
third line, parameter 1 of P(m,n-2), is removed, and the weight of parameter 1
13

is added into the parameter S-C of P(m,n-1) by S-C-1 as shown in (2). The
combined inversed T-model filter can be produced in the same way. In the new
architecture of the combined filter, the two T-model or inversed T-model filters are
combined into one combined 1-model or inversed 1-
model filter. By this filter-combination technique, the demand of memory can be
efficiently decreased from two to one line buffer, which greatly reduces memory
Access requirements for software systems or hardware memory costs for VLSI
implementation.

14

C. Simplified Bilinear I nterpolation
In the proposed scaling algorithm, the bilinear interpolation method is
selected because of its characteristics with low complexity and high quality. The
bilinear interpolation is an operation that performs a linear interpolation first in one
direction and, then again, in the other direction. The output pixel _,1 can be
calculated by the operations of the linear interpolation in both x- and y-directions
with the four nearest neighbor pixels. The target pixel _,1can be calculated by

Where P(m,n), P(m+1,n), P(m,n+1), and P(m+1,n+1),are the four nearest
neighbor pixels of the original image and the dx and dy are scale parameters in the
horizontal and vertical directions.

Fig. 2. Block diagram of the VLSI architecture for proposed real-time image
scaling processor.
By (2), we can easily find that the computing resources of the bilinear interpolation
cost eight multiply, four subtract, and three addition operations. It costs a
15

considerable chip area to implement a bilinear interpolator with eight multipliers
and seven adders. Thus, an algebraic manipulation skill has been used to reduce the
computing resources of the bilinear interpolation. The original equation of bilinear
interpolation is presented in (2), and the simplifying procedures of bilinear
interpolation can be described from (4)(6). Since the function of dy x (P(m,n+1)
P(m,n) + P(m,n) appears twice in (6), one of the two calculations for this
algebraic function can be reduced

By the characteristic of the executing direction in bilinear interpolation [15],
the values of dy for all pixels that are selected on the vertical axis of n row equal to
n + I row, and only the values of dx must be changed with the position of x. The
result of the function dy x (P(n,n+i) P(m,n))1" can be replaced by the previous
result of "[P(n+Ln) + dy x (P(m+1.n+i) P(m-Flim))1" as shown in (6). The
simplifying procedures successfully reduce the computing resource from eight
multiply, four subtract, and three add operations to two multiply, two subtract, and
two add operations.

16

3.1. VLSI ARCHITECTURE
The proposed scaling algorithm consists of two combined prefilters and one
simplified bilinear interpolator. For VLSI implementation, the bilinear interpolator
can directly obtain two input pixels from two combined prefilters without any
additional line-buffer memory. Fig. 3 shows the block diagram of the VLSI
architecture for the proposed design. It consists of four main blocks: a register
bank, a combined filter, a bilinear interpolator, and a controller. The details of each
part will be described in the following sections.
A. Register Bank
In this brief, the combined filter is filtering to produce the target pixels by using
ten source pixels. The register bank is designed with a one-line memory buffer,
which is used to provide the ten values for the immediate usage

Fig. 3. Architecture of the register bank.
17

Fig. 4. Computational scheduling of the proposed combined filter and simplified
bilinear interpolator. of the combined filter. Fig. 3 shows the architecture of
the register bank with a structure of ten shift registers. When the shifting control
signal is produced from the controller, a new value of P- w ill be read into Reg41,
and each value stored in other registers belonging to row n + 1 will be shifted right
into the next register or line-buffer memory. The Reg40 reads a new value of
P(m+2,.) from the line-buffer memory, and each value in other registers belonging
to row n will be shifted right into the next register.

18

B. Combined Filter
The combined T-model or inversed T-model convolution function of the
sharpening spatial and clamp filters had been discussed in Section II, and the
equation is represented in (1). Fig. 5 shows the six-stage pipelined architecture of
the combined filter and bilinear interpolator, which shortens the delay path to
improve the performance by pipeline technology. The stages 1 and 2 in Fig. 5
show the computational scheduling of a T-model combined and an inversed T-
model filter. The T-model or inversed T-model filter consists of three
reconfigurable calculation units (RCUs), one multiplieradder (MA), three adders
(+), three subtracters (), and three shifters (S). The hardware architecture of the
T-model combined filter can be directly mapped with the convolution equation
shown in (1). The values of the ten source pixels can be obtained from the register
bank mentioned earlier. The symmetrical circuit, as shown in stages 1 and 2 of Fig.
5, is the inversed T-model combined filter designed for producing the filtered
result of lion,n+i). Obviously, The Tmodel and the inversed T-model are used to
obtain the values of (P,11,,,) and p(m, n + 1)' simultaneously. The architecture of
this symmetrical circuit is a similar symmetrical structure of the T-model
combined filter, as shown in stages 1 and 2 of Fig. 5. Both of the combined filter
and symmetrical circuit consist of one MA and three RCUs. The MA can be
implemented by a multiplier and an adder. The RCU is designed for producing the
calculation functions of (S-C) and (S-C-1) times of the source pixel value, which
must be implemented with C and S parameters. The C and S parameters can be set
by users according to the characteristics of the images. The architecture of the
proposed low-cost combined filter can filter the whole image with only a one-line-
buffer memory, which successfully decreases the memory requirement from four
to one line buffer of the combined filter in our previous work [15].

19

TABLE I
PARAMETERS AND COMPUTING RESOURCE FOR RCU

Fig. 5. Architecture of the RCU.
Table I lists the parameters and computing resource for the RCU. With the
selected C and S values listed in Table I, the gain of the clamp or sharp
convolution function is {8, 16, 32} or [4, 8, 16), which can be eliminated by a
shifter rather than a divider. Fig. 6 shows the architecture of the RCU. It consists of
four shifters, three multiplexers (MUX), three adders, and one sign circuit. By this
RCU design, the hardware cost of the combined filters can be efficiently reduced.

20

C.Bilinear I nterpolator and Controller

In the previous discussion, the bilinear interpolation is simplified as shown
in (6). The stages 3, 4, 5, and 6 in Fig. 5 show the four-stage pipelined architecture,
and two stage pipelined multipliers are used to shorten the delay path of the
bilinear interpolator. The input values of /1.,.) and P(m,n+i) are obtained from the
combined filter and symmetrical circuit. By the hardware sharing technique, as
shown in (6), the temperature result of the function can be replaced by the previous
result. " It also means that one multiplier and two adders can be successfully
reduced by adding only one register. The controller is implemented by a finite-
statemachine circuit. It produces control signals to control the timing and pipeline
stages of the register bank, combined filter, and bilinear interpolator.

21

3.3. SIMULATION RESULTS AND CHIP
IMPLEMENTATION
To be able to analyze the qualities of the scaled images by various scaling
algorithms, a peak signal-to-noise ratio (PSNR) is used to quantify a noisy
approximation of the refined and the original images. Since the maximum value
of each pixel is 255, the PSNR expressed in dB can be calculated as

22

TABLE II
COMPARISONS OF AVERAGE PSNR FOR VARIOUS SCALING
ALGORITHMS

A: using sharp filter and bilinear interpolation, 13: using clamp filler and bilinear
interpolation. C: using combined filter and bilinear interpolation

TABLE III
COMPARISON OF COMPUTING RESOURCE AND MEMORY
REQUIREMENT

23

Fig. 6. Chip photomicrograph.

where M and N are the width and height of the original image. Furthermore, eight
widely used test images [15] with the size 512 x 512 were selected for testing. In
the quality evaluation procedure, each test image should be filtered by a fixed low-
pass filter (averaging filter) and then scaled up/down to different sizes such as 256
x 256 (half size), 352 x 288 common intermediate format (CIF), 640 x 480 video
graphics array (VGA), 720 x 480 (D1), 1024 x 1024 (double size), and 1980 x
1080 highdefinition multimedia interface (HDMT) as listed in Table II. To show
the quality of the images changed after using the clamp filter, sharp filter, and the
proposed combined filter, the three kinds of PSNR results in this work are listed as
A (sharp filter), B (clamp filter), and C (combined filter) in Table H. The
experimental results show that this work achieves better quantitative quality than
the previous low-complexity scaling algorithms [1], [10], [12]. The average PSNR
of the bilinear interpolation [1] or this work is 28.15 or 28.54, which means that
the combined T-model and inversed T-model filters improve the image quality by
0.39 dB. The quantitative qualities of bicubic (BC) [13] and our previous work
24

[15] are better than this work because [13] and [15] obtain the target pixel by more
complex calculation and refer to more neighboring pixels than this work. As listed
in Table HI, the multiplication operations of [13] are 32 which is eight times the
quantity of this work, and the memory requirement of [13] or [15] is six or four
lines which is six or four times the amount of the one-line buffer memory in this
work.

25

4. Software and Hardware Requirements
4.1. VERY-LARGE-SCALE INTEGRATION
Very-large-scale integration (VLSI) is the process of creating integrated circuits by
combining thousands of transistors into a single chip. VLSI began in the 1970s when complex
semiconductor and communication technologies were being developed. The microprocessor is a
VLSI device. The term is no longer as common as it once was, as chips have increased in
complexity into billions of transistors
The first semiconductor chips held two transistors each. Subsequent advances added
more and more transistors, and, as a consequence, more individual functions or systems were
integrated over time. The first integrated circuits held only a few devices, perhaps as many as ten
diodes, transistors, resistors and capacitors, making it possible to fabricate one or more logic
gates on a single device. Now known retrospectively as small-scale integration (SSI),
improvements in technique led to devices with hundreds of logic gates, known as medium-scale
integration (MSI). Further improvements led to large-scale integration (LSI), i.e. systems with at
least a thousand logic gates. Current technology has moved far past this mark and today's
microprocessors have many millions of gates and billions of individual transistors.
At one time, there was an effort to name and calibrate various levels of large-scale
integration above VLSI. Terms like ultra-large-scale integration (ULSI) were used. But the huge
number of gates and transistors available on common devices has rendered such fine distinctions
moot. Terms suggesting greater than VLSI levels of integration are no longer in widespread use.
Even VLSI is now somewhat quaint, given the common assumption that all microprocessors are
VLSI or better.
As of early 2008, billion-transistor processors are commercially available. This is
expected to become more commonplace as semiconductor fabrication moves from the current
generation of 65 nm processes to the next 45 nm generations (while experiencing new challenges
such as increased variation across process corners). A notable example is Nvidia's 280 series
GPU. This GPU is unique in the fact that almost all of its 1.4 billion transistors are used for
26

logic, in contrast to the Itanium, whose large transistor count is largely due to its 24 MB L3
cache. Current designs, as opposed to the earliest devices, use extensive design automation and
automated logic synthesis to lay out the transistors, enabling higher levels of complexity in the
resulting logic functionality. Certain high-performance logic blocks like the SRAM (Static
Random Access Memory) cell, however, are still designed by hand to ensure the highest
efficiency (sometimes by bending or breaking established design rules to obtain the last bit of
performance by trading stability).
Structured design
Structured VLSI design is a modular methodology originated by Carver Mead and Lynn
Conway for saving microchip area by minimizing the interconnect fabrics area. This is obtained
by repetitive arrangement of rectangular macro blocks which can be interconnected using wiring
by abutment. An example is partitioning the layout of an adder into a row of equal bit slices
cells. In complex designs this structuring may be achieved by hierarchical nesting.
Structured VLSI design had been popular in the early 1980s, but lost its popularity later because
of the advent of placement and routing tools wasting a lot of area by routing, which is tolerated
because of the progress of Moore's Law. When introducing the hardware description language
KARL in the mid' 1970s, Reiner Hartenstein coined the term "structured VLSI design"
(originally as "structured LSI design"), echoing Edsger Dijkstra's structured programming
approach by procedure nesting to avoid chaotic spaghetti-structured programs.
VERILOG
In the semiconductor and electronic design industry, Verilog is a hardware description
language (HDL) used to model electronic systems. Verilog HDL, not to be confused with
VHDL, is most commonly used in the design, verification, and implementation of digital logic
chips at the register transfer level (RTL) of abstraction. It is also used in the verification of
analog and mixed-signal circuits.
Hardware description languages, such as Verilog, differ from software programming
languages because they include ways of describing the propagation of time and signal
27

dependencies (sensitivity). There are two assignment operators, a blocking assignment (=), and a
non-blocking (<=) assignment. The non-blocking assignment allows designers to describe a
state-machine update without needing to declare and use temporary storage variables (in any
general programming language we need to define some temporary storage spaces for the
operands to be operated on subsequently; those are temporary storage variables) Since these
concepts are part of Verilog's language semantics, designers could quickly write descriptions of
large circuits, in a relatively compact and concise form. At the time of Verilog's introduction
(1984), Verilog represented a tremendous productivity improvement for circuit designers who
were already using graphical schematic capture software and specially-written software
programs to document and simulate electronic circuits.
The designers of Verilog wanted a language with syntax similar to the C programming
language, which was already widely used in engineering software development. Verilog is case-
sensitive, has a basic preprocessor (though less sophisticated than that of ANSI C/C++), and
equivalent control flow keywords (if/else, for, while, case, etc.), and compatible operator
precedence. Syntactic differences include variable declaration (Verilog requires bit-widths on
net/reg types), demarcation of procedural blocks (begin/end instead of curly braces {}), and
many other minor differences.
A Verilog design consists of a hierarchy of modules. Modules encapsulate design
hierarchy, and communicate with other modules through a set of declared input, output, and
bidirectional ports. Internally, a module can contain any combination of the following:
net/variable declarations (wire, reg, integer, etc.), concurrent and sequential statement blocks,
and instances of other modules (sub-hierarchies). Sequential statements are placed inside a
begin/end block and executed in sequential order within the block. But the blocks themselves are
executed concurrently, qualifying Verilog as a dataflow language.
Verilog's concept of 'wire' consists of both signal values (4-state: "1, 0, floating,
undefined"), and strengths (strong, weak, etc.) This system allows abstract modeling of shared
signal-lines, where multiple sources drive a common net. When a wire has multiple drivers, the
wire's (readable) value is resolved by a function of the source drivers and their strengths.
28

A subset of statements in the Verilog language are synthesizable. Verilog modules that
conform to a synthesizable coding-style, known as RTL (register transfer level), can be
physically realized by synthesis software. Synthesis-software algorithmically transforms the
(abstract) Verilog source into a netlist, a logically-equivalent description consisting only of
elementary logic primitives (AND, OR, NOT, flipflops, etc.) that are available in a specific
FPGA or VLSI technology. Further manipulations to the netlist ultimately lead to a circuit
fabrication blueprint (such as a photo mask set for an ASIC, or a bitstream file for an FPGA).
Beginning
Verilog was invented by Phil Moorby and Prabhu Goel during the winter of 1983/1984 at
Automated Integrated Design Systems (renamed to Gateway Design Automation in 1985) as a
hardware modeling language. Gateway Design Automation was purchased by Cadence Design
Systems in 1990. Cadence now has full proprietary rights to Gateway's Verilog and the Verilog-
XL simulator logic simulators.

Verilog-95
With the increasing success of VHDL at the time, Cadence decided to make the language
available for open standardization. Cadence transferred Verilog into the public domain under the
Open Verilog International (OVI) (now known as Accellera) organization. Verilog was later
submitted to IEEE and became IEEE Standard 1364-1995, commonly referred to as Verilog-95.
In the same time frame Cadence initiated the creation of Verilog-A to put standards
support behind its analog simulator Spectre. Verilog-A was never intended to be a standalone
language and is a subset of Verilog-AMS which encompassed Verilog-95.
Verilog 2001
Extensions to Verilog-95 were submitted back to IEEE to cover the deficiencies that
users had found in the original Verilog standard. These extensions became IEEE Standard 1364-
2001 known as Verilog-2001.
29

Verilog-2001 is a significant upgrade from Verilog-95. First, it adds explicit support for
(2's complement) signed nets and variables. Previously, code authors had to perform signed-
operations using awkward bit-level manipulations (for example, the carry-out bit of a simple 8-
bit addition required an explicit description of the boolean-algebra to determine its correct
value). The same function under Verilog-2001 can be more succinctly described by one of the
built-in operators: +, -, /, *, >>>. A generate/endgenerate construct (similar to VHDL's
generate/endgenerate) allows Verilog-2001 to control instance and statement instantiation
through normal decision-operators (case/if/else). Using generate/endgenerate, Verilog-2001 can
instantiate an array of instances, with control over the connectivity of the individual instances.
File I/O has been improved by several new system-tasks. And finally, a few syntax additions
were introduced to improve code-readability (e.g. always @*, named-parameter override, C-
style function/task/module header declaration).
Verilog-2001 is the dominant flavor of Verilog supported by the majority of commercial
EDA software packages.
Verilog 2005
Not to be confused with SystemVerilog, Verilog 2005 (IEEE Standard 1364-2005)
consists of minor corrections, spec clarifications, and a few new language features (such as the
uwire keyword).
A separate part of the Verilog standard, Verilog-AMS, attempts to integrate analog and
mixed signal modelling with traditional Verilog.
SystemVerilog
SystemVerilog is a superset of Verilog-2005, with many new features and capabilities to
aid design-verification and design-modeling.
The advent of hardware verification languages such as OpenVera, and Verisity's e
language encouraged the development of Superlog by Co-Design Automation Inc. Co-Design
30

Automation Inc was later purchased by Synopsys. The foundations of Superlog and Vera were
donated to Accellera, which later became the IEEE standard P1800-2005:
VHDL (VHSIC Hardware Description Language):
VHDL (VHSIC hardware description language) is a hardware description language used
in electronic design automation to describe digital and mixed-signal systems such as field-
programmable gate arrays and integrated circuits.
VHDL was originally developed at the behest of the U.S Department of Defense in order
to document the behavior of the ASICs that supplier companies were including in equipment.
That is to say, VHDL was developed as an alternative to huge, complex manuals which were
subject to implementation-specific details.
The idea of being able to simulate this documentation was so obviously attractive that
logic simulators were developed that could read the VHDL files. The next step was the
development of logic synthesis tools that read the VHDL, and output a definition of the physical
implementation of the circuit.
Because the Department of Defense required as much of the syntax as possible to be based on
Ada, in order to avoid re-inventing concepts that had already been thoroughly tested in the
development of Ada, VHDL borrows heavily from the Ada programming language in both
concepts and syntax.
The initial version of VHDL, designed to IEEE standard 1076-1987, included a wide range of
data types, including numerical (integer and real), logical (bit and boolean), character and time,
plus arrays of bit called bit_vector and of character called string.
A problem not solved by this edition, however, was "multi-valued logic", where a signal's drive
strength (none, weak or strong) and unknown values are also considered. This required IEEE
standard 1164, which defined the 9-value logic types: scalar std_ulogic and its vector version
std_ulogic_vector.
31

The an updated IEEE 1076, in 1993, made the syntax more consistent, allowed more
flexibility in naming, extended the character type to allow ISO-8859-1 printable characters,
added the xnor operator, etc.
Minor changes in the standard (2000 and 2002) added the idea of protected types (similar
to the concept of class in C++) and removed some restrictions from port mapping rules.In
addition to IEEE standard 1164, several child standards were introduced to extend functionality
of the language. IEEE standard 1076.2 added better handling of real and complex data types.
IEEE standard 1076.3 introduced signed and unsigned types to facilitate arithmetical operations
on vectors. IEEE standard 1076.1 (known as VHDL-AMS) provided analog and mixed-signal
circuit design extensions.
Some other standards support wider use of VHDL, notably VITAL (VHDL Initiative
Towards ASIC Libraries) and microwave circuit design extensions. In February 2008, Accellera
approved VHDL 4.0 also informally known as VHDL 2008, which addressed more than 90
issues discovered during the trial period for version 3.0 and includes enhanced generic types. In
2008, Accellera released VHDL 4.0 to the IEEE for balloting for inclusion in IEEE 1076-2008.
The VHDL standard IEEE 1076-2008 was published in September 2008.
DESIGN
VHDL is commonly used to write text models that describe a logic circuit. Such a model
is processed by a synthesis program, only if it is part of the logic design. A simulation program is
used to test the logic design using simulation models to represent the logic circuits that interface
to the design. This collection of simulation models is commonly called a testbench.
VHDL has constructs to handle the parallelism inherent in hardware designs, but these
constructs (processes) differ in syntax from the parallel constructs in Ada (tasks). Like Ada,
VHDL is strongly typed and is not case sensitive. In order to directly represent operations which
are common in hardware, there are many features of VHDL which are not found in Ada, such as
an extended set of Boolean operators including nand and nor. VHDL also allows arrays to be
indexed in either ascending or descending direction; Both conventions are used in hardware,
whereas in Ada and most programming languages only ascending indexing is available.
32

VHDL has file input and output capabilities, and can be used as a general-purpose
language for text processing, but files are more commonly used by a simulation testbench for
stimulus or verification data. There are some VHDL compilers which build executable binaries.
In this case, it might be possible to use VHDL to write a testbench to verify the functionality of
the design using files on the host computer to define stimuli, to interact with the user, and to
compare results with those expected. However, most designers leave this job to the simulator.
It is relatively easy for an inexperienced developer to produce code that simulates
successfully but that cannot be synthesized into a real device, or is too large to be practical. One
particular pitfall is the accidental production of transparent latches rather than D-type flip-flops
as storage elements.
One can design hardware in a VHDL IDE (for FPGA implementation such as Xilinx ISE,
Altera Quartus, Synopsys Synplify or Mentor Graphics HDL Designer) to produce the RTL
schematic of the desired circuit. After that, the generated schematic can be verified using
simulation software which shows the waveforms of inputs and outputs of the circuit after
generating the appropriate testbench. To generate an appropriate testbench for a particular circuit
or VHDL code, the inputs have to be defined correctly. For example, for clock input, a loop
process or an iterative statement is required.
A final point is that when a VHDL model is translated into the "gates and wires" that are
mapped onto a programmable logic device such as a CPLD or FPGA, then it is the actual
hardware being configured, rather than the VHDL code being "executed" as if on some form of a
processor chip.
Advantages
The key advantage of VHDL when used for systems design is that it allows the behavior
of the required system to be described (modeled) and verified (simulated) before synthesis tools
translate the design into real hardware (gates and wires).
33

Another benefit is that VHDL allows the description of a concurrent system. VHDL is a
Dataflow language, unlike procedural computing languages such as BASIC, C, and assembly
code, which all run sequentially, one instruction at a time.
Field-programmable Gate Array (FPGA):
A Field-programmable Gate Array (FPGA) is an integrated circuit designed to be
configured by the customer or designer after manufacturinghence "field-programmable". The
FPGA configuration is generally specified using a hardware description language (HDL), similar
to that used for an application-specific integrated circuit (ASIC) (circuit diagrams were
previously used to specify the configuration, as they were for ASICs, but this is increasingly
rare). FPGAs can be used to implement any logical function that an ASIC could perform. The
ability to update the functionality after shipping, partial re-configuration of the portion of the
design and the low non-recurring engineering costs relative to an ASIC design (notwithstanding
the generally higher unit cost), offer advantages for many applications.
FPGAs contain programmable logic components called "logic blocks", and a hierarchy of
reconfigurable interconnects that allow the blocks to be "wired together"somewhat like many
(changeable) logic gates that can be inter-wired in (many) different configurations . Logic blocks
can be configured to perform complex combinational functions, or merely simple logic gates like
AND and XOR. In most FPGAs, the logic blocks also include memory elements, which may be
simple flip-flops or more complete blocks of memory.
In addition to digital functions, some FPGAs have analog features. The most common
analog feature is programmable slew rate and drive strength on each output pin, allowing the
engineer to set slow rates on lightly loaded pins that would otherwise ring unacceptably, and to
set stronger, faster rates on heavily loaded pins on high-speed channels that would otherwise run
too slow. Another relatively common analog feature is differential comparators on input pins
designed to be connected to differential signaling channels. A few "mixed signal FPGAs" have
integrated peripheral Analog-to-Digital Converters (ADCs) and Digital-to-Analog Converters
(DACs) with analog signal conditioning blocks allowing them to operate as a system-on-a-chip.
Such devices blur the line between an FPGA, which carries digital ones and zeros on its internal
34

programmable interconnect fabric, and field-programmable analog array (FPAA), which carries
analog values on its internal programmable interconnect fabric.
History
The FPGA industry sprouted from programmable read-only memory (PROM) and
programmable logic devices (PLDs). PROMs and PLDs both had the option of being
programmed in batches in a factory or in the field (field programmable), however programmable
logic was hard-wired between logic gates.
[6]

In the late 1980s the Naval Surface Warfare Department funded an experiment proposed
by Steve Casselman to develop a computer that would implement 600,000 reprogrammable
gates. Casselman was successful and a patent related to the system was issued in 1992.
Some of the industrys foundational concepts and technologies for programmable logic arrays,
gates, and logic blocks are founded in patents awarded to David W. Page and LuVerne R.
Peterson in 1985.
[7][8]

Xilinx Co-Founders, Ross Freeman and Bernard Vonderschmitt, invented the first
commercially viable field programmable gate array in 1985 the XC2064. The XC2064 had
programmable gates and programmable interconnects between gates, the beginnings of a new
technology and market. The XC2064 boasted a mere 64 configurable logic blocks (CLBs), with
two 3-input lookup tables (LUTs). More than 20 years later, Freeman was entered into the
National Inventor's Hall of Fame for his invention.
Xilinx continued unchallenged and quickly growing from 1985 to the mid-1990s, when
competitors sprouted up, eroding significant market-share. By 1993, Actel was serving about 18
percent of the market.
The 1990s were an explosive period of time for FPGAs, both in sophistication and the
volume of production. In the early 1990s, FPGAs were primarily used in telecommunications
and networking. By the end of the decade, FPGAs found their way into consumer, automotive,
and industrial applications.
35

FPGAs got a glimpse of fame in 1997, when Adrian Thompson, a researcher working at
the University of Sussex, merged genetic algorithm technology and FPGAs to create a sound
recognition device. Thomsons algorithm configured an array of 10 x 10 cells in a Xilinx FPGA
chip to discriminate between two tones, utilising analogue features of the digital chip. The
application of genetic algorithms to the configuration of devices like FPGA's is now referred to
as Evolvable hardware
Modern developments
A recent trend has been to take the coarse-grained architectural approach a step further by
combining the logic blocks and interconnects of traditional FPGAs with embedded
microprocessors and related peripherals to form a complete "system on a programmable chip".
This work mirrors the architecture by Ron Perlof and Hana Potash of Burroughs Advanced
Systems Group which combined a reconfigurable CPU architecture on a single chip called the
SB24. That work was done in 1982. Examples of such hybrid technologies can be found in the
Xilinx Virtex-II PRO and Virtex-4 devices, which include one or more PowerPC processors
embedded within the FPGA's logic fabric. The Atmel FPSLIC is another such device, which uses
an AVR processor in combination with Atmel's programmable logic architecture. The Actel
SmartFusion devices incorporate an ARM architecture Cortex-M3 hard processor core (with up
to 512kB of flash and 64kB of RAM) and analog peripherals such as a multi-channel ADC and
DACs to their flash-based FPGA fabric.
An alternate approach to using hard-macro processors is to make use of soft processor
cores that are implemented within the FPGA logic.
As previously mentioned, many modern FPGAs have the ability to be reprogrammed at
"run time," and this is leading to the idea of reconfigurable computing or reconfigurable systems
CPUs that reconfigure themselves to suit the task at hand. The Mitrion Virtual Processor from
Mitrionics is an example of a reconfigurable soft processor, implemented on FPGAs. However, it
does not support dynamic reconfiguration at runtime, but instead adapts itself to a specific
program.
36

Additionally, new, non-FPGA architectures are beginning to emerge. Software-
configurable microprocessors such as the Stretch S5000 adopt a hybrid approach by providing an
array of processor cores and FPGA-like programmable cores on the same chip.
Gates
1987: 9,000 gates, Xilinx
1992: 600,000, Naval Surface Warfare Department
Early 2000s: Millions
Market size
1985: First commercial FPGA technology invented by Xilinx
1987: $14 million
~1993: >$385 million
2005: $1.9 billion
2010 estimates: $2.75 billion
FPGA design starts
10,000
2005: 80,000
2008: 90,000

FPGA comparisons
Historically, FPGAs have been slower, less energy efficient and generally achieved less
functionality than their fixed ASIC counterparts. A study has shown that designs implemented on
FPGAs need on average 18 times as much area, draw 7 times as much dynamic power, and are 3
times slower than the corresponding ASIC implementations.
37

Advantages include the ability to re-program in the field to fix bugs, and may include a
shorter time to market and lower non-recurring engineering costs. Vendors can also take a
middle road by developing their hardware on ordinary FPGAs, but manufacture their final
version so it can no longer be modified after the design has been committed.
Xilinx claims that several market and technology dynamics are changing the ASIC/FPGA
paradigm:
Integrated circuit costs are rising aggressively
ASIC complexity has lengthened development time
R&D resources and headcount are decreasing
Revenue losses for slow time-to-market are increasing
Financial constraints in a poor economy are driving low-cost technologies
These trends make FPGAs a better alternative than ASICs for a larger number of higher-
volume applications than they have been historically used for, to which the company attributes
the growing number of FPGA design starts (see History).
Some FPGAs have the capability of partial re-configuration that lets one portion of the
device be re-programmed while other portions continue running.
Versus complex programmable logic devices
The primary differences between CPLDs (Complex Programmable Logic Devices) and
FPGAs are architectural. A CPLD has a somewhat restrictive structure consisting of one or more
programmable sum-of-products logic arrays feeding a relatively small number of clocked
registers. The result of this is less flexibility, with the advantage of more predictable timing
delays and a higher logic-to-interconnect ratio. The FPGA architectures, on the other hand, are
dominated by interconnect. This makes them far more flexible (in terms of the range of designs
that are practical for implementation within them) but also far more complex to design for.
Another notable difference between CPLDs and FPGAs is the presence in most FPGAs of
higher-level embedded functions (such as adders and multipliers) and embedded memories, as
well as to have logic blocks implement decoders or mathematical functions.
38

Security considerations
With respect to security, FPGAs have both advantages and disadvantages as compared to
ASICs or secure microprocessors. FPGAs' flexibility makes malicious modifications during
fabrication a lower risk.
[21]
For many FPGAs, the loaded design is exposed while it is loaded
(typically on every power-on). To address this issue, some FPGAs support bitstream encryption.
Applications
Applications of FPGAs include digital signal processing, software-defined radio,
aerospace and defense systems, ASIC prototyping, medical imaging, computer vision, speech
recognition, cryptography, bioinformatics, computer hardware emulation, radio astronomy, metal
detection and a growing range of other areas.
FPGAs originally began as competitors to CPLDs and competed in a similar space, that
of glue logic for PCBs. As their size, capabilities, and speed increased, they began to take over
larger and larger functions to the state where some are now marketed as full systems on chips
(SoC). Particularly with the introduction of dedicated multipliers into FPGA architectures in the
late 1990s, applications which had traditionally been the sole reserve of DSPs began to
incorporate FPGAs instead.
FPGAs especially find applications in any area or algorithm that can make use of the
massive parallelism offered by their architecture. One such area is code breaking, in particular
brute-force attack, of cryptographic algorithms.
FPGAs are increasingly used in conventional high performance computing applications
where computational kernels such as FFT or Convolution are performed on the FPGA instead of
a microprocessor.
The inherent parallelism of the logic resources on an FPGA allows for considerable
computational throughput even at a low MHz clock rates. The flexibility of the FPGA allows for
even higher performance by trading off precision and range in the number format for an
39

increased number of parallel arithmetic units. This has driven a new type of processing called
reconfigurable computing, where time intensive tasks are offloaded from software to FPGAs.
The adoption of FPGAs in high performance computing is currently limited by the
complexity of FPGA design compared to conventional software and the turn-around times of
current design tools.
Traditionally, FPGAs have been reserved for specific vertical applications where the
volume of production is small. For these low-volume applications, the premium that companies
pay in hardware costs per unit for a programmable chip is more affordable than the development
resources spent on creating an ASIC for a low-volume application. Today, new cost and
performance dynamics have broadened the range of viable applications.
4.2. ARCHITECTURE
The most common FPGA architecture consists of an array of logic blocks (called
Configurable Logic Block, CLB, or Logic Array Block, LAB, depending on vendor), I/O pads,
and routing channels. Generally, all the routing channels have the same width (number of wires).
Multiple I/O pads may fit into the height of one row or the width of one column in the array.
An application circuit must be mapped into an FPGA with adequate resources. While the
number of CLBs/LABs and I/Os required is easily determined from the design, the number of
routing tracks needed may vary considerably even among designs with the same amount of logic.
For example, a crossbar switch requires much more routing than a systolic array with the same
gate count. Since unused routing tracks increase the cost (and decrease the performance) of the
part without providing any benefit, FPGA manufacturers try to provide just enough tracks so that
most designs that will fit in terms of LUTs and IOs can be routed. This is determined by
estimates such as those derived from Rent's rule or by experiments with existing designs.
In general, a logic block (CLB or LAB) consists of a few logical cells (called ALM, LE,
Slice etc). A typical cell consists of a 4-input Lookup table (LUT), a Full adder (FA) and a D-
type flip-flop, as shown below. The LUT are in this figure split into two 3-input LUTs. In normal
mode those are combined into a 4-input LUT through the left mux. In arithmetic mode, their
40

outputs are fed to the FA. The selection of mode are programmed into the middle mux. The
output can be either synchronous or asynchronous, depending on the programming of the mux to
the right, in the figure example. In practice, entire or parts of the FA are put as functions into the
LUTs in order to save space.

Simplified example illustration of a logic cell
ALMs and Slices usually contains 2 or 4 structures similar to the example figure, with some
shared signals.
CLBs/LABs typically contains a few ALMs/LEs/Slices.
In recent years, manufacturers have started moving to 6-input LUTs in their high
performance parts, claiming increased performance. Since clock signals (and often other high-
fanout signals) are normally routed via special-purpose dedicated routing networks in
commercial FPGAs, they and other signals are separately managed.
For this example architecture, the locations of the FPGA logic block pins are shown below.

Logic Block Pin Locations
41

Each input is accessible from one side of the logic block, while the output pin can
connect to routing wires in both the channel to the right and the channel below the logic block.
Each logic block output pin can connect to any of the wiring segments in the channels adjacent
to it. Similarly, an I/O pad can connect to any one of the wiring segments in the channel adjacent
to it. For example, an I/O pad at the top of the chip can connect to any of the W wires (where W
is the channel width) in the horizontal channel immediately below it. Generally, the FPGA
routing is unsegmented. That is, each wiring segment spans only one logic block before it
terminates in a switch box. By turning on some of the programmable switches within a switch
box, longer paths can be constructed. For higher speed interconnect, some FPGA architectures
use longer routing lines that span multiple logic blocks.
Whenever a vertical and a horizontal channel intersect, there is a switch box. In this
architecture, when a wire enters a switch box, there are three programmable switches that allow
it to connect to three other wires in adjacent channel segments. The pattern, or topology, of
switches used in this architecture is the planar or domain-based switch box topology. In this
switch box topology, a wire in track number one connects only to wires in track number one in
adjacent channel segments, wires in track number 2 connect only to other wires in track number
2 and so on. The figure below illustrates the connections in a switch box.

Switch box topology
42

Modern FPGA families expand upon the above capabilities to include higher level
functionality fixed into the silicon. Having these common functions embedded into the silicon
reduces the area required and gives those functions increased speed compared to building them
from primitives. Examples of these include multipliers, generic DSP blocks, embedded
processors, high speed IO logic and embedded memories.
FPGAs are also widely used for systems validation including pre-silicon validation, post-
silicon validation, and firmware development. This allows chip companies to validate their
design before the chip is produced in the factory, reducing the time-to-market.
4.3. FPGA design and programming
To define the behavior of the FPGA, the user provides a hardware description language
(HDL) or a schematic design. The HDL form is more suited to work with large structures
because it's possible to just specify them numerically rather than having to draw every piece by
hand. However, schematic entry can allow for easier visualisation of a design.
Then, using an electronic design automation tool, a technology-mapped netlist is
generated. The netlist can then be fitted to the actual FPGA architecture using a process called
place-and-route, usually performed by the FPGA company's proprietary place-and-route
software. The user will validate the map, place and route results via timing analysis, simulation,
and other verification methodologies. Once the design and validation process is complete, the
binary file generated (also using the FPGA company's proprietary software) is used to
(re)configure the FPGA.
Going from schematic/HDL source files to actual configuration: The source files are fed
to a software suite from the FPGA/CPLD vendor that through different steps will produce a file.
This file is then transferred to the FPGA/CPLD via a serial interface (JTAG) or to an external
memory device like an EEPROM.
The most common HDLs are VHDL and Verilog, although in an attempt to reduce the
complexity of designing in HDLs, which have been compared to the equivalent of assembly
languages, there are moves to raise the abstraction level through the introduction of alternative
43

languages. National Instrument's LabVIEW graphical programming language ( sometimes
referred to as "G" ) has an FPGA add-in module available to target and program FPGA
hardware. The LabVIEW approach drastically simplifies the FPGA programming process
To simplify the design of complex systems in FPGAs, there exist libraries of predefined
complex functions and circuits that have been tested and optimized to speed up the design
process. These predefined circuits are commonly called IP cores, and are available from FPGA
vendors and third-party IP suppliers (rarely free, and typically released under proprietary
licenses). Other predefined circuits are available from developer communities such as
OpenCores (typically released under free and open source licenses such as the GPL, BSD or
similar license), and other sources.
In a typical design flow, an FPGA application developer will simulate the design at
multiple stages throughout the design process. Initially the RTL description in VHDL or Verilog
is simulated by creating test benches to simulate the system and observe results. Then, after the
synthesis engine has mapped the design to a netlist, the netlist is translated to a gate level
description where simulation is repeated to confirm the synthesis proceeded without errors.
Finally the design is laid out in the FPGA at which point propagation delays can be added and
the simulation run again with these values back-annotated onto the netlist.
Basic process technology types
SRAM - based on static memory technology. In-system programmable and re-
programmable. Requires external boot devices. CMOS.
Antifuse - One-time programmable. CMOS.
PROM - Programmable Read-Only Memory technology. One-time programmable
because of plastic packaging.
EPROM - Erasable Programmable Read-Only Memory technology. One-time
programmable but with window, can be erased with ultraviolet (UV) light. CMOS.
EEPROM - Electrically Erasable Programmable Read-Only Memory technology. Can be
erased, even in plastic packages. Some but not all EEPROM devices can be in-system
programmed. CMOS.
44

Flash - Flash-erase EPROM technology. Can be erased, even in plastic packages. Some
but not all flash devices can be in-system programmed. Usually, a flash cell is smaller
than an equivalent EEPROM cell and is therefore less expensive to manufacture. CMOS.
Fuse - One-time programmable. Bipolar.
Major manufacturers
Xilinx and Altera are the current FPGA market leaders and long-time industry rivals.
Together, they control over 80 percent of the market, with Xilinx alone representing over 50
percent. Both Xilinx and Altera provide free Windows and Linux design software.
Other competitors include Lattice Semiconductor (SRAM based with integrated
configuration Flash, instant-on, low power, live reconfiguration), Actel (antifuse, flash-based,
mixed-signal), SiliconBlue Technologies (extremely low power SRAM-based FPGAs with
option integrated nonvolatile configuration memory), Achronix (RAM based, 1.5 GHz fabric
speed) who will be building their chips on Intels' state-of-the art 22nm process, and QuickLogic
(handheld focused CSSP, no general purpose FPGAs).
Network processor
A network processor is an integrated circuit which has a feature set specifically targeted
at the networking application domain. Network processors are typically software programmable
devices and would have generic characteristics similar to general purpose central processing
units that are commonly used in many different types of equipment and products.
In modern telecommunications networks, information (voice, video, data) is now
transferred as packet data (termed packet switching) rather than previously in older
telecommunications networks as analog signals such as in the public switched telephone network
(PSTN) or analog TV/Radio networks. The processing of these packets has resulted in the
creation of integrated circuits (IC) that are optimised to deal with this form of packet data.
Network Processors have specific features or architectures that are provided to enhance and
optimise packet processing within these networks.
45

Network processors have evolved into ICs with specific functions. This evolution has
resulted in more complex and more flexible ICs being created. The newer circuits are
programmable and thus allow a single hardware IC design to undertake a number of different
functions, where the appropriate software is installed.
Network processors are used in the manufacture of many different types of network equipment
such as:
1. Routers, software routers and switches
2. Firewalls
3. Session Border Controllers
4. Intrusion detection devices
5. Intrusion prevention devices
6. Network monitoring systems
Generic functions
In the generic role as a packet processor, a number of optimised features or functions are
typically present in a network processor, these include:
1. Pattern matching - the ability to find specific patterns of bits or bytes within packets in a
packet stream.
2. Key lookup for example, address lookup - the ability to quickly undertake a database
lookup using a key (typically an address on a packet) to find a result, typically routing
information.
3. Computation
4. Data bitfield manipulation - the ability to change certain data fields contained in the
packet as it is being processed.
5. Queue management - as packets are received, processed and scheduled to be send
onwards, they are stored in queues.
6. Control processing - the micro operations of processing a packet are controlled at a macro
level which involves communication and orchestration with other nodes in a system.
46

7. Quick allocation and re-circulation of packet buffers.
Architectural paradigms
In order to deal with high data-rates, several architectural paradigms have been commonly used:
1. Pipeline of processors - each stage of the pipeline consisting of an entire processor
performing one of the functions listed above.
2. Parallel processing with multiple processors, often including multithreading.
3. specialized microcoded engines to more efficiently accomplish the tasks at hand.
4. recently, multicored architectures are used for higher layers (L4-L7), application
processing.
Additionally, traffic management, which is a critical element in L2-L3 network processing
and used to be executed by a variety of co-processors, becomes in integral part of the network
processor architecture, and a substantial part of its silicon area ("real estate") is devoted to the
integrated traffic manager
Applications
Using the generic function of the network processor, a software program implements an
application that the network processor executes, resulting in the piece of physical equipment
performing a task or providing a service. Some of the applications types typically implemented
as software running on network processors are:
1. Packet or frame discrimination and forwarding, that is, the basic operation of a router or
switch.
2. Quality of service (QoS) enforcement - identifying different types or classes of packets
and providing preferential treatment for some types or classes of packet at the expense of
other types or classes of packet.
3. Access Control functions - determining whether a specific packet or stream of packets
should be allowed to traverse the piece of network equipment.
47

4. Encryption of data streams - built in hardware-based encryption engines allow individual
data flows to be encrypted by the processor.
5. TCP offload processing
APPLICATION-SPECIFIC INTEGRATED CIRCUIT
An application-specific integrated circuit (ASIC) (pronounced /esk/) is an integrated
circuit (IC) customized for a particular use, rather than intended for general-purpose use. For
example, a chip designed solely to run a cell phone is an ASIC. Application specific standard
products (ASSPs) are intermediate between ASICs and industry standard integrated circuits like
the 7400 or the 4000 series.
As feature sizes have shrunk and design tools improved over the years, the maximum
complexity (and hence functionality) possible in an ASIC has grown from 5,000 gates to over
100 million. Modern ASICs often include entire 32-bit processors, memory blocks including
ROM, RAM, EEPROM, Flash and other large building blocks. Such an ASIC is often termed a
SoC (system-on-a-chip). Designers of digital ASICs use a hardware description language (HDL),
such as Verilog or VHDL, to describe the functionality of ASICs.
Field-programmable gate arrays (FPGA) are the modern-day technology for building a
breadboard or prototype from standard parts; programmable logic blocks and programmable
interconnects allow the same FPGA to be used in many different applications. For smaller
designs and/or lower production volumes, FPGAs may be more cost effective than an ASIC
design even in production. The non-recurring engineering cost of an ASIC can run into the
millions of dollars.
History
The initial ASICs used gate array technology. Ferranti produced perhaps the first gate-
array, the ULA (Uncommitted Logic Array), around 1980. An early successful commercial
application was the ULA circuitry found in the 8-bit ZX81 and ZX Spectrum low-end personal
computers, introduced in 1981 and 1982. These were used by Sinclair Research (UK) essentially
as a low-cost I/O solution aimed at handling the computer's graphics. Some versions of
48

ZX81/Timex Sinclair 1000 used just four chips (ULA, 2Kx8 RAM, 8Kx8 ROM, Z80A CPU) to
implement an entire mass-market personal computer with built-in BASIC interpreter.
Customization occurred by varying the metal interconnect mask. ULAs had complexities
of up to a few thousand gates. Later versions became more generalized, with different base dies
customised by both metal and polysilicon layers. Some base dies include RAM elements.
Standard cell design
In the mid 1980s a designer would choose an ASIC manufacturer and implement their
design using the design tools available from the manufacturer. While third party design tools
were available, there was not an effective link from the third party design tools to the layout and
actual semiconductor process performance characteristics of the various ASIC manufacturers.
Most designers ended up using factory specific tools to complete the implementation of their
designs. A solution to this problem that also yielded a much higher density device was the
implementation of Standard Cells. Every ASIC manufacturer could create functional blocks with
known electrical characteristics, such as propagation delay, capacitance and inductance, that
could also be represented in third party tools. Standard Cell design is the utilization of these
functional blocks to achieve very high gate density and good electrical performance. Standard
cell design fits between Gate Array and Full Custom design in terms of both its NRE (Non-
Recurring Engineering) and recurring component cost.
By the late 1990s, logic synthesis tools became available. Such tools could compile HDL
descriptions into a gate-level netlist. This enabled a style of design called standard-cell design.
Standard-cell Integrated Circuits (ICs) are designed in the following conceptual stages, although
these stages overlap significantly in practice.
These steps, implemented with a level of skill common in the industry, almost always produce a
final device that correctly implements the original design, unless flaws are later introduced by
the physical fabrication process.
1. A team of design engineers starts with a non-formal understanding of the required
functions for a new ASIC, usually derived from Requirements analysis.
49

2. The design team constructs a description of an ASIC to achieve these goals using an
HDL. This process is analogous to writing a computer program in a high-level language.
This is usually called the RTL (Register transfer level) design.
3. Suitability for purpose is verified by functional verification. This may include such
techniques as logic simulation, formal verification, emulation, or creating an equivalent
pure software model (see Simics, for example). Each technique has advantages and
disadvantages, and often several methods are used.
4. Logic synthesis transforms the RTL design into a large collection of lower-level
constructs called standard cells. These constructs are taken from a standard-cell library
consisting of pre-characterized collections of gates (such as 2 input nor, 2 input nand,
inverters, etc.). The standard cells are typically specific to the planned manufacturer of
the ASIC. The resulting collection of standard cells, plus the needed electrical
connections between them, is called a gate-level netlist.
5. The gate-level netlist is next processed by a placement tool which places the standard
cells onto a region representing the final ASIC. It attempts to find a placement of the
standard cells, subject to a variety of specified constraints.
6. The routing tool takes the physical placement of the standard cells and uses the netlist to
create the electrical connections between them. Since the search space is large, this
process will produce a sufficient rather than globally-optimal solution. The output is
a file which can be used to create a set of photomasks enabling a semiconductor
fabrication facility (commonly called a 'fab') to produce physical ICs.
7. Given the final layout, circuit extraction computes the parasitic resistances and
capacitances. In the case of a digital circuit, this will then be further mapped into delay
information, from which the circuit performance can be estimated, usually by static
timing analysis. This, and other final tests such as design rule checking and power
analysis (collectively called signoff) are intended to ensure that the device will function
correctly over all extremes of the process, voltage and temperature. When this testing is
complete the photomask information is released for chip fabrication.
These design steps (or flow) are also common to standard product design. The significant
difference is that Standard Cell design uses the manufacturer's cell libraries that have been used
50

in potentially hundreds of other design implementations and therefore are of much lower risk
than full custom design. Standard Cells produce a design density that is cost effective, and they
can also integrate IP cores and SRAM (Static Random Access Memory) effectively, unlike Gate
Arrays.
Gate array design
Gate array design is a manufacturing method in which the diffused layers, i.e. transistors
and other active devices, are predefined and wafers containing such devices are held in stock
prior to metallization, in other words, unconnected. The physical design process then defines the
interconnections of the final device. For most ASIC manufacturers, this consists of from two to
as many as nine metal layers, each metal layer running perpendicular to the one below it. Non-
recurring engineering costs are much lower as photo-lithographic masks are required only for the
metal layers, and production cycles are much shorter as metallization is a comparatively quick
process.
Gate array ASICs are always a compromise as mapping a given design onto what a
manufacturer held as a stock wafer never gives 100% utilization. Often difficulties in routing the
interconnect require migration onto a larger array device with consequent increase in the piece
part price. These difficulties are often a result of the layout software used to develop the
interconnect.
Pure, logic-only gate array design is rarely implemented by circuit designers today,
replaced almost entirely by field-programmable devices, such as field-programmable gate arrays
(FPGAs), which can be programmed by the user and thus offer minimal tooling charges (non-
recurring engineering (NRE)), marginally increased piece part cost and comparable performance.
Today gate arrays are evolving into structured ASICs that consist of a large IP core like a CPU,
DSP unit, peripherals, standard interfaces, integrated memories SRAM, and a block of
reconfigurable uncommited logic. This shift is largely because ASIC devices are capable of
integrating such large blocks of system functionality and "system on a chip" requires far more
than just logic blocks.
51

In their frequent usages in the field, the terms "gate array" and "semi-custom" are
synonymous. Process engineers more commonly use the term "semi-custom" while "gate-array"
is more commonly used by logic (or gate-level) designers.
Full-custom design
By contrast, full-custom ASIC design defines all the photo lithographic layers of the
device. Full-custom design is used for both ASIC design and for standard product design.
The benefits of full-custom design usually include reduced area (and therefore recurring
component cost), performance improvements, and also the ability to integrate analog
components and other pre-designed (and thus fully verified) components such as microprocessor
cores that form a system-on-chip.
The disadvantages of full-custom design can include increased manufacturing and design
time, increased non-recurring engineering costs, more complexity in the computer-aided design
(CAD) system and a much higher skill requirement on the part of the design team. However for
digital-only designs, "standard-cell" cell libraries together with modern CAD systems can offer
considerable performance/cost benefits with low risk. Automated layout tools are quick and easy
to use and also offer the possibility to "hand-tweak" or manually optimise any performance-
limiting aspect of the design.
Structured design
Structured ASIC design (also referred to as "platform ASIC design") has different
meanings in different contexts. This is a relatively new term in the industry, which is why there
is some variation in its definition. However, the basic premise of a structured ASIC is that both
manufacturing cycle time and design cycle time are reduced compared to cell-based ASIC by
virtue of there being pre-defined metal layers (thus reducing manufacturing time) and pre-
characterization of what is on the silicon (thus reducing design cycle time). One definition states
that in a "structured ASIC" design, the logic mask-layers of a device are predefined by the ASIC
vendor (or in some cases by a third party). Design differentiation and customization is achieved
by creating custom metal layers that create custom connections between predefined lower-layer
52

logic elements. "Structured ASIC" technology is seen as bridging the gap between field-
programmable gate arrays and "standard-cell" ASIC designs. Because only a small number of
chip layers must be custom-produced, "structured ASIC" designs have much smaller non-
recurring expenditures (NRE) than "standard-cell" or "full-custom" chips, which require that a
full mask set be produced for every design.
This is effectively the same definition as a gate array.
What makes a structured ASIC different from a gate array is that in a gate array the
predefined metal layers serve to make manufacturing turnaround faster. In a structured ASIC the
predefined metallization is primarily to reduce cost of the mask sets and is also used to make the
design cycle time significantly shorter as well. For example, in a cell-based or gate-array design
the user often must design power, clock, and test structures themselves; these are predefined in
most structured ASICs and therefore can save time and expense for the designer compared to
gate-array. Likewise, the design tools used for structured ASIC can be substantially lower cost
and easier (faster) to use than cell-based tools, because the tools do not have to perform all the
functions that cell-based tools do. In some cases, the structured ASIC vendor requires that
customized tools for their device (for example, custom physical synthesis) be used, also allowing
for the design to be brought into manufacturing more quickly. ChipX, Inc. eAsic, and Triad
Semiconductor are examples of vendors offering this kind of structured ASIC.
One other important aspect about structured ASIC is that it allows IP that is common to
certain applications or industry segments to be "built in", rather than "designed in". By building
the IP directly into the architecture the designer can again save both time and money compared
to designing IP into a cell-based ASIC.
The Altera technique of producing a structured cell ASIC where the cells are the same
design as the FPGA, but the programmable routing is replaced with fixed wire interconnect is
called HardCopy. These devices then do not need re-programming and cannot be re-programmed
as an FPGA.
The Xilinx technique of producing a customer specific FPGA, that is 30% - 70% less
expensive than a standard FPGA and where the cells are the same as the FPGA but the
53

programmable capability is removed, is called EasyPath. Cell libraries, IP-based design, hard and
soft macros
Cell libraries of logical primitives are usually provided by the device manufacturer as
part of the service. Although they will incur no additional cost, their release will be covered by
the terms of a non-disclosure agreement (NDA) and they will be regarded as intellectual property
by the manufacturer. Usually their physical design will be pre-defined so they could be termed
"hard macros".
What most engineers understand as "intellectual property" are IP cores, designs
purchased from a third party as sub-components of a larger ASIC. They may be provided as an
HDL description (often termed a "soft macro"), or as a fully routed design that could be printed
directly onto an ASIC's mask (often termed a hard macro). Many organizations now sell such
pre-designed cores, and larger organizations may have an entire department or division to
produce cores for the rest of the organization. For example, one can purchase CPUs, ethernet,
USB or telephone interfaces. Indeed, the wide range of functions now available is a significant
factor in the phenomenal increase in electronics in the late 1990s and early 2000s; as a core takes
a lot of time and investment to create, its re-use and further development cuts product cycle
times dramatically and creates better products. Additionally, organizations such as OpenCores
are collecting free IP cores paralleling the OpenSource movement in hardware.
Soft macros are often process-independent, i.e., they can be fabricated on a wide range of
manufacturing processes and different manufacturers.
Hard macros are process-limited and usually further design effort must be invested to
migrate (port) to a different process or manufacturer.
Multi-project wafers
Some manufacturers offer Multi-Project Wafers (MPW) as a method of obtaining low
cost prototypes. Often called shuttles, these MPW, containing several designs, run at regular,
scheduled intervals on a "cut and go" basis, usually with very little liability on the part of the
manufacturer. The contract involves the assembly and packaging of a handful of devices. The
service usually involves the supply of a physical design data base i.e. masking information or
54

Pattern Generation (PG) tape. The manufacturer is often referred to as a "silicon foundry" due to
the low involvement it has in the process. See also Multi Project Chip.
ASIC suppliers
There are two different types of ASIC suppliers, IDM and fabless. An IDM supplier's
ASIC product is based in large part on proprietary technology such as design tools, IP,
packaging, and usually although not necessarily the process technology. Fabless ASIC suppliers
rely almost exclusively on outside suppliers for their technology. The classification can be
confusing since several IDM's are also fabless semiconductor companies.

55

5. CONCLUSION
In this brief, a minimized cost, less-memory requirement and high-performance VLSI
architecture of the image scaling processor had been proposed. The filter combining, sharing of
hardware, and reconfigurable dynamic techniques had been used to reduce hardware cost.
Approximating with previous low complexity VLSI image scalar designs, this work achieves at
least 34.5% reduction in gate counts and requires only one-line memory buffer.

56

6. Coding
//////////////////////////////////////////////////////////////////////////////////
module
sv(q1,q2,o,datain1,dataout1,stackempty1,stackfull1,writetostack1,readfromstack1,clk,rst,datain2,
dataout2,stackempty2,stackfull2,writetostack2,readfromstack2
);
output [4:0]o;
reg[4:0]o;
parameter stackwidth1=4;
parameter stackheight1=8;
parameter stackptrwidth1=3;
output [stackwidth1-1:0]dataout1;
output stackempty1 ,stackfull1;
input [stackwidth1-1:0]datain1;
input clk,rst;
integer i;
input writetostack1,readfromstack1;
reg [stackptrwidth1-1:0]readptr1,writeptr1;
reg [stackwidth1-1:0]dataout1;
reg [stackptrwidth1:0]ptrdiff1;
output [4:0]q1,q2;
reg [4:0]q1,q2;

reg [stackwidth1-1:0]stack1[stackheight1-1:0];
assign stackempty1=(ptrdiff1==0)?1'b1:1'b0;
assign stackfull1=(ptrdiff1==stackheight1)?1'b1:1'b0;

parameter stackwidth2=4;
57

parameter stackheight2=8;
parameter stackptrwidth2=3;
output [stackwidth2-1:0]dataout2;
output stackempty2 ,stackfull2;
input [stackwidth2-1:0]datain2;

input writetostack2,readfromstack2;
reg [stackptrwidth2-1:0]readptr2,writeptr2;
reg [stackwidth2-1:0]dataout2;
reg [stackptrwidth2:0]ptrdiff2;

reg [stackwidth2-1:0]stack2[stackheight2-1:0];
assign stackempty2=(ptrdiff2==0)?1'b1:1'b0;
assign stackfull2=(ptrdiff2==stackheight2)?1'b1:1'b0;

always @(posedge clk or posedge rst)
begin
if(rst)
begin
dataout1<=0;
readptr1<=0;
writeptr1<=0;
ptrdiff1<=0;
end
else begin
if((readfromstack1)&&(!stackempty1))begin
dataout1<=stack1[readptr1];
readptr1<=readptr1+1;
ptrdiff1<=ptrdiff1-1;
58

end
else if((writetostack1)&&(!stackfull1))begin
stack1[writeptr1]<=datain1;
writeptr1<=writeptr1+1;
ptrdiff1<=ptrdiff1+1;
end
end
q1<={1'b1,dataout1[3:0]};
end

always @(posedge clk or posedge rst)
begin
if(rst)
begin
dataout2<=0;
readptr2<=0;
writeptr2<=0;
ptrdiff2<=0;
end
else begin
if((readfromstack2)&&(!stackempty2))begin
dataout2<=stack2[readptr2];
readptr2<=readptr2+1;
ptrdiff2<=ptrdiff2-1;

end
else if((writetostack2)&&(!stackfull2))begin
stack2[writeptr2]<=datain2;
writeptr2<=writeptr2+1;
ptrdiff2<=ptrdiff2+1;
59

end

end
q2<={1'b0,dataout2[3:0]};
end

always @(posedge clk or posedge rst )
begin

if(q1[4]==1'b1)
begin
o=q1;
end
else
begin
o=q2;

end

end

endmodule

60

SCREEN SHOTS
Input

Output

61

RTL schematic:

62

7. REFERENCES

[1] IC. Jensen and D. Anastassiou, "Subpixel edge localization and the
interpolation of still images," IEEE Trans. Image Process., vol. 4, no. 3, pp. 285-
295, Mar. 1995.

[2] H. Kim. Y. Cha, and S. Kim, "Curvature interpolation method for image
ooming," IEEE Trans. ImageProcess.. vol. 20, no. 7, pp. 1895-1903. Jul. 2011.

[3] J. W. Han, J. H. Kim, S. H. Cheon, J. 0. Kim, and S. J. Ko, "A novel image
interpolation method using the bilateral filter." IEEE Trans. C'onsum. Electron..
vol. 56, no. 1. pp. 175-181, Feb. 2010.

[4] X. Zbang and X. Wu, "Image Interpolation by adaptive 2- D autoregress1ve
modeling and soft-decision estimation." IEEE Trans. Image Process., vol. 17, no.
6, pp. 887- 896, Jun. 2008.

[5] F. Cardells-Tormo and J. Arnabat-Benedicto, "Flexible hardware-friendly
digital architecture for 2-D separable convolution-based scaling," IEEE Trans.
Circuits Syst. II, Exp. Briefs, vol. 53, no. 7, pp. 522-526. Jul. 2006.

[6] S. Ridella, S. Rovetta, and R. Zunino. "IAVQ-intervalarithmetic vector
quantization for image compression," IEEE Trans. Circuits Syst. II, Analog Digit.
Signal Process., vol. 47, no. 12. pp. 1378-1390. Dec. 2000.

[7] S. Saponara, L. Fanucci, S. Marsi, G. Ramponi,D. Kammler, and E. M. Witte.
"Application-specific instruction-set processor for Retinexlink image and video
63

processing," IEEE Trans. Circuits Syst. II, Exp. Briefs. vol. 54, no. 7, pp. 596-600.
Jul. 2007.

[8] P. Y. Chen. C. C. Huang, Y. H. Shiau, and Y. T. Chen, "A VLSI
implementation of barrel distortion correction for wide-angle camera images,"
IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 56, no. 1, pp. 51-55, Jan. 2009.

[9] M. Fons, F. Fons, and E. Canto, "Fingerprint image processing acceleration
through run-time reconfigurahle hardware," IEEE Trans. Circuits Syst. II, Exp.
Briefs. vol. 57, no. 12, pp. 991-995. Dec. 2010.

[10] C. H. Kim, S. M. Seong, J. A. Lee. and L S. Kim. "Winscale : An image-
scaling algorithm using an area pixel model," IEEE Trans. Circuits Syst. Video
TechnoL, vol. 13, no. 6, pp. 549-553. Jun. 2003.

[11] C. C. Lin. Z. C. Wu. W. K. Tsai, M. H. Sheu. And H. K. Chiang, "The VLSI
design of winscale for digital image scaling," in Proc. IEEE Conf Irma Inf. Hiding
Multimedia Signal Process., Nov. 2007, pp. 511-514.

[12] P. Y. Chen, C. Y. Lien, and C. P. Lu, "VLSI implementation of an edge-
oriented image scaling processor," IEEE Trans. Very Large Scale Integr.

Main Report

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Main Report

Hochgeladen von

Copyright:

Verfügbare Formate

1

Das könnte Ihnen auch gefallen