Sie sind auf Seite 1von 15

CHAPTER 4

EMBEDDED EDGE DETECTION HARDWARE ACCELERATOR


IMPLEMENTATION
4.1

Introduction

The embedded edge detection hardware accelerator is implemented on Altera FPGA


Stratix III DSP development board. The presented design uses a resolution of 720x480
and running in real time, 30 frame per second (fps) which are improvement of previous
works [17, 18]. The designs are coded in VHDL language and synthesized by Altera
Quartus II software before downloaded into Altera FPGA Stratix III DSP development
board. In this chapter the design process is described. It is begun by hardware accelerator
datapath architecture design. Datapath architecture is architecture platform to
accommodate the function of hardware accelerator unit. Next design is control unit. It
controls the operational of hardware accelerator and its communication with other
peripherals in the stage of design integration. Design integration is a process to compose
the complete hardware design of embedded edge detection hardware accelerator. During
design integration process it requires computer aid design tools that come along with
FPGA developing board being used [15, 16]. And last but not least in the design process
is software programming. Software programming ensures the edge detection application
is run on top of hardware design.
4.2

Datapath Architecture

Datapath architecture must accommodate the specification of design as mentioned earlier.


Since this design uses Sobel edge detection algorithm, the hardware accelerator is
accommodate the function of Sobel edge detection algorithm including feature of real
time. The function of Sobel edge detection algorithm requires calculation of the
derivative pixels from the original image pixels. The procedures of calculation commence
with the reading of the original image from the memory and loaded into a transit register.
The next step is to perform the computation process of derivative image. Finally the
derivative result is written back to memory.
To perform the algorithm with resolution 30 frames per second as required for real time
processing, hardware accelerator architecture datapath with memory bandwidth reduction

is implemented to deal with computation capacity and memory bandwidth constraint.


The memory bandwidth reduction is achieved by parallelism in reading the original
image pixel data from the memory and pipelining in processing of derivative computation
as refer to [8]. This method manages to achieve 75% memory bandwidth reduction
compared to the preliminary architecture [5]. The original and derivative images are
stored in the memory with 32 bits wide and image data has an individual address. The
pixel frames are stored as 1byte per pixel. A row of pixels in a frame is stored from left to
right at the respective address and a group of arrows stored from top to bottom, row by
row. Read or write operation process to memory takes 18.5ns which is triggered by two
cycles of a 110MHz clock operation. As discussed earlier 720x480x30 pixels arrive from
the camera at approximately 14 million pixels per second or one pixel per 72ns. This
means each operation writing or reading a single pixel to the memory will take
approximately 20% of memory bandwidth. Aggregation of four pixels in writing to the
memory with single access will save the memory bandwidth 5% instead of writing each
pixel individually.
Another process which has the potential to reduce the memory bandwidth is derivative
computation process. This process is accomplished by comparing the centre pixel with
other pixels in 3x3 gradient Sobel operator kernels. As the centre pixel proceeds for
derivative computation, simultaneously it accesses eight pixel values from the original
image. If this computation processes normally it requires 20% multiplied by 8 pixels i.e.
totally memory bandwidth required is 160%. This is unfeasible. In order to eliminate the
memory bandwidth the computation deploys a pipeline together with parallelism on
reading the original image pixel from memory. Since the reading of the original image
pixel from memory is achieved by aggregation of 4 pixels then perform 3 parallel with
additional 2 adjacent of 4 pixels. Using this method, the reading needs 3 pixels for every
fourth pixel being computed and the left hand pixel will be stored within the accelerator
to compute multiple derivative pixels. The memory bandwidth is therefore reduced to
15%. Once the derivative computation is complete, the result is written to the memory
and another 5% bandwidth is utilized. The total memory bandwidth required for all
process computation is 25%. The Sobel edge detection hardware accelerator datapath is
shown in figure 3.1.

The datapath design shows the interaction between block accelerator and memory. These
2 blocks are assisted by other blocks which can achieve memory bandwidth reduction.
The datapath consists of 5 inputs and 2 outputs which are composed by 8 sub design
modules. Those 8 sub design modules are a clock divider controller which converts block
from 110MHz clock input into 54 MHz with 18.5ns pulse width. This pulse is needed to
support the operational of the accelerator control block and control address block. The
second block is the accelerator control. This block performs Sobel edge detection
computation process and also acts as a controller for the flow of image data from and to
the memory unit. The output of the accelerator control block will utilize the input of the
following blocks: address counter, memory access controller, and acknowledge
generator. The next block is the address counter. The address location for the original
image data and derivative image data is assigned by the address counter block. The
assignment of the specific address is performed by increment on the base address for the
respective image data pixel. The other block is the memory access controller to control
the changing format of data when accessing from memory, from parallel to serial or vice
versa. It is also in charge of memory control access register block to receive data from
memory or to transmit the data to memory. One more block is memory access register
also connected to the supporting function for memory access. The data is temporarily
saved in this block before being sent to or received from memory. Once the packet of
data is received or sent completely this block will trigger acknowledge generator to
produce an acknowledge signal. Another block is the memory unit where data pixel,
derivative address, 1st row rawpix address, 2nd rawpix address, and 3rd rawpix address
are stored. The address decoder block determines the address to be accessed in the
memory unit. The last block is acknowledge generator block which indicate the data has
been completely read or written, then the acknowledge generator block will produce an
acknowledge signal. The detail regarding 8 sub design modules is described below:
4.2.1 Clock Divider
The design of clock divider module is illustrated on figure 4.1.

Clk_in
Clk_en

Clock_Div
Rst_in
Figure 4.1 Clock divider

As shown on above block diagram, the input ports are Clk_in and Rst_in and the output
port is Clk_en. The input of Clk_in is 110 MHz clock source with the duty cycle 50%.
The expecting output clock rate is 54 MHz which is half of input clock source. The input
clock source is divided by factor of 2 before come to this output clock rate. The duty
cycle for 54 MHz is also 50%.

4.2.2 Accelerator Control


Accelerator control is significant module to control the function of hardware accelerator.
It has 3 input ports and 1 output ports as shown below:

Clk_en
Start

Accelerator Control

State

Rst_in

Figure 4.2 Accelerator control


The state output port influences the following block modules e.g. address counter,
memory unit, address decoder, and acknowledge generator.
4.2.3 Address Counter
Address counter is required to prevent overlapping in data storage. The block design of
address counter is shown below:

Clk_en
State
Selector
Rst_in

Address Counter

Mem_Addrs

Data_in
Figure 4.3 Address counter
These 4 input ports e.g. Clk_en, State, Selector and Rst_in is the condition for the output
port of Mem_addrs.
4.2.4 Memory Access Controller
Memory access controller arranges data format in memory access. The 3 input ports e.g.
Clk_en, State and Mem_Addrs will determine which data format will produced by output
port of Reg_ctrl.

Clk_en
State

Mem_Access_Ctrl

Reg_ctrl

Mem_Addrs
drs
Figure 4.4 Memory Access Control
4.2.5 Memory Access Register
The input ports value of memory access register is coming from the output of the
following modules e.g. clock divider, address counter and memory access control. These
3 input ports will determine the following output ports e.g. mem_en and ack_in as shown
below:

Clk_en
Reg_Ctrl

Mem_en
Mem_Access_Reg

Ack_in

Mem_Addrs
drs
Figure 4.5 Memory Access Register
4.2.6 Address Decoder
Address decoder is a module to interpret the address of requesting data in memory. This
module has 1 input port and 1 output port as shown in figure 4.6.

Addrs_in

Addrs_Deco

Deco_out

Figure 4.6 Address Decoder


4.2.7 Acknowledge Generator
Acknowledge generator is a module to produce acknowledge signal. The output signals
from following modules e.g. memory access register and accelerator control will
determine the output of acknowledge generator module as shown in figure 4.7.

Ack_in
Ackn_gen

Ack_out

State
Figure 4.7 Acknowledge Generator
4.2.8 Memory Unit
Memory unit is a module where information about data pixel and addresses for derivative
and 3 rows of rawpix (1st row, 2nd row and 3rd row) is located. Each of this information is

stored in respective register and access to the register controlled by address decoder.
Figure 4.8 shows memory unit.

Clock
Reset
State
Deco_out

Memory Unit

Mem_out

Mem_en
Figure 4.8 Memory Unit

The operational of 8 sub-design modules above is controlled by controlling unit. The


controlling unit is part of 8 sub-design described above. They are 2 modules that have
functioned as controlling unit e.g. accelerator control and memory access control.
However, the accelerator control is the main controlling unit that controls a whole design
module. In the following sub chapter the controlling process and conditions of process
are discussed.
4.3

Controlling Unit

The controlling unit is composed by the scenario of event. The design of scenario is
performed by finite state machine (FSM). In FSM modeling, the event is represented by
state. The FSM model for accelerator control has 11 states as shown in table 4.1.

Table 4.1 List of states accelerator control


NO.
1
2
3
4
5

STATES

REMARK

CONDITION

IDLE

RESET=1

START

RUN=1

ST

ROW RAWPIX IN MEMORY

LOAD=00

ND

ROW RAWPIX IN MEMORY

LOAD=01

RD

ROW RAWPIX IN MEMORY

LOAD=10

READ 1

READ 2

READ 3

ACKNOWLEDGE

IF THE STATES OF

C, D, E, G, H, I,J
AND K ARE
COMPLETED THEIR
FULL CYCLING

COMPUTATION DERIVATIVE FOR 1ST 4

COMPUTATION DERIVATIVE FOR 2ND 4

COMPUTATION DERIVATIVE FOR 3RD 4

10

COMPUTATION DERIVATIVE FOR 4TH 4

11

DERIVATIVE WRITE RESULT TO MEMORY

ACK=10

ADJACENT PIXELS
ADJACENT PIXELS
ADJACENT PIXELS

STATE G:
COL=180
STATE H:
COL=180
STATE I: COL=180

ADJACENT PIXELS
ACK=11

It is begun with idle state where initiated by reset. The start state is activated by set run
value to 1. The following states are read 1st row rawpix, read 2nd row rawpix and read
3rd row rawpix. These states are activated by read=01, read=10 and read=11,
respectively. These three states are representative of 1st row data image pixels, 2nd row
data image pixels and 3rd row data image pixels that read parallel from memory. The
acknowledge state will activated when each of these states is reached 720 data pixels.
The counter is deployed for counting up to 720 data pixels for each of 1st row rawpix
state, 2nd row rawpix state and 3rd row rawpix state. The next state is 1st group
computation that activated by ack=10. This state will complete computation for 180
pixels in order to activate 2nd group computation. The same as 1st group computation,
this state will complete the computation for 180 pixels in order to activate 3rd group
computation. The state of 3rd group computation also will complete the computation for
180 pixels in order to move to acknowledge state. These 4 computation states are
representative for derivative computation for each of 4 adjacent pixels within one row of
data image pixels that have been read through read 1st row rawpix, read 2nd row rawpix
and read 3rd row rawpix. The same as read states, these 4 states also deploy a counter for
counting up to 180 pixels. The write derivative state will activated by ack=11. This state
will complete writing for 480 rows in order to comeback to acknowledge state. The
counter also deployed for this state for counting up to 480 rows. And finally acknowledge

state move to idle state with the condition of the ack=01. The complete design of
accelerator control is shown in figure 4.9.

PX<720

C
K
Load=00

PY<720

Row<480
Col<180

Ack=11

Load=01

B
Run=1

Ack=10

Col<180

Load=10
PZ<720

Reset

Ack=01

A
E

Col<180

I
J
Col<180

Figure 4.8 Finite State Machine of Accelerator Control


.
4.4

Design Integration

The design integration is a process to integrate all the sub module designs to create a new
complex design. As mentioned in beginning of this chapter, the design implementation is
performed on Altera FPGA Stratix III DSP development board. Therefore, design
integration utilizes the Computer Aid Design (CAD) software tool from Altera. Its
involving 2 type of CAD software tools e.g. DSP Builder and System On a
Programmable Chip (SOPC) Builder. The design integration process is divided in 2 parts.
The first part is to transfer all design modules in previous sub chapter into DSP Builder
model based. The second part is to integrate DSP Builder model based design with Nios

II processor and others peripherals such as Memory, Input Output (I/O), etc. The figure
4.9 shows the expected integrated architecture design.

NIOS II
DMA

DMA

Input Buff.
DDR2

H/W accelerator
Sobel Edge
Detector

Buff.
DDR2

Output
SRAM

buff.

VGA
Controller

Display

Figure 4.9 the expected integrated architecture design

In the transferring design modules in previous sub chapter into DSP Builder model based,
the interfacing with others component in above architecture must take into account. As
the design in previous sub chapter is stand alone, system interconnect fabric is required
for the integrated architecture. Some modification is needed to accommodate system
interconnect fabric. As shown in figure 4.9 the system utilize NIOS II processor,
therefore Avalon bus interfacing is deploy for the system interconnect fabric because it is
NIOS II standard bus interfacing. There are 3 components of Avalon bus involved during
integration process i.e. Avalon MM Write, Avalon MM Read and Avalon Control Port.
For modification of the hardware accelerator it is necessary to include ports for
interfacing purposes because the data traffic flows through these ports. The hardware
accelerator datapath controls the flow of data. There are 3 main activities involved inside
the hardware accelerator datapath i.e. read data in memory, derivative computation and
write derivative result into memory. Read and write activities are performed by direct

memory access (DMA). DMA reads pixel data from the memory and puts into 3 parallel
register 32 bit each accommodating 4 pixels of 8 bits each. These register also parallels
supply pixel by pixel into derivative computation arrays that executed by pipelining. The
derivative computation result will move into a 32 bit result register that locate 4 pixels
with 8 bits each of pixels. Once this register loaded by 4 pixels the DMA will write the
data pixel subsequently into memory. When DMA performs read and write operations in
memory, it requires an address generator. The address generator will perform counting
from the base address by reading pixel data and writing the derivative result. The inputs
such as data pixel, offset counter enable signal, enable signal for both read operation and
write operation are required to develop datapath for the address generator. The datapath
determines pixel addresses using two base address registers: pixel read register and
derivative write register. The main memory capacity is 8Mbytes which is organized into
1M x 32bits, equal to 220 or 20 bits. The pixel data input and the output of the address
selector is implemented into 20bits. Pixel offsets count from the base address which starts
at 0 and increments by 1 for each group of 4 pixel data read from memory. Offset is
added to the base address for the 1st read register the address. The 2nd read register
address is reached by adding 720/4 to the 1st read register address. Lastly, the 3rd read
register address is achieved by add the 1440/4 to the 1st reading register address. The
counting for the derivative result address starts from 720/4 offset value and increases by 1
for each memory write the address selector will select the appropriate calculated address
to produce the memory address. The address generator datapath has 4 total registers:
pixel read register, derivative write register, offset pixel read register and offset derivative
write register. The pixel read register and derivative write register represents the base
address of the pixel image data and derivative image data, accordingly. The offset pixel
read register and the offset derivative write register represent the counting of pixel groups
read and derivative results written, accordingly. Together all these 4 registers are
controlled by a signal that is generated from the hardware accelerator control design
block. The combinational blocks are deployed to add four address signals belong to 1st
read register, 2nd read register, 3rd read register and write derivative. These 4
combinational outputs will reach the address selector. The address selector is represented

by a multiplexer which to selects one of 4 address signals. The detail of datapath is


shown at figure 4.10.

Figure 4.10 Memory Address Generator

The remaining integration aspect of the hardware accelerator system is Avalon bus
interfacing signals. All the signals correspond to the input and output of the hardware
accelerator. The signals are assigned when the integration process with SOPC builder is
performed.
The hardware accelerator parts described previously were implemented as components of
an embedded system controlled by a NIOS II processor. The NIOS II program is used for
data streaming control. In order to make a complete embedded system it is required to
involve other peripheral component such as: on-chip memory for Avalon memory-map

(MM) slave; interval timer; DMA controller; parallel input-output (PIO); JTAG UART;
system ID peripheral, etc. Those peripheral components are Intellectual Property (IP)
cores provided by Altera which is added during SOPC process integration as shown in
figure 4.11.

Figure 4.10 Memory Address Generator


System On a Programmable Chip (SOPC) builder design software is a tool part of
Quartus II environment. When the process in SOPC builder is completed, the result will
be appear in the Quartus II project window. In the Quartus II project environment, the
process continues with timing analysis. The FPGA based design require timing analysis
since routing within the FPGA device will vary based on where the Quartus II Fitter
places the logic. Entering timing constraints will provide the Quartus II Fitter with design
goals to make intelligent choices about where to place the logic and other elements in the
design and then will provide the Quartus II TimeQuest Timing Analyzer with information
so that it can report whether the timing goals is met. The constraints are coded in an
industry standard language called SDC (Synopsys Design Constraints). SOPC builder
automatically generated SDC files for components which provide timing information. At
this point the design is ready for compilation. The Quartus II software will take a few
minutes to compile the design. There should be no errors in the compilation process and
the successful completion dialog should appear when it is finished. The output of the
compilation is a SOF file. This file will be downloaded into FPGA via USB blaster cable.

4.5

Software Programming

Now the design is downloaded into FPGA. In order to make the design perform
functionality, it requires software to control the hardware. The software is programmed
on NIOS II processor using its standard instruction code, according to the specifications
of NIOS II Hardware Abstraction Layer (HAL). The software is originally programmed
in ANSI C and has a two main purpose:
(1) It controls hardware operations, like DMA transfers between hardware units. It
also offers a programming interface for handling data channels, with API
commands like open, read, write and close.
(2) It allows the system to perform simple software operations on the input data
instead of using dedicated hardware stages for such processing. For example Nios
instruction code can be used to convert image arrays into appropriate onedimensional data streams.
NIOS instruction code is downloaded to on-chip memory. The code for the particular
design opens the USB port and waits to read the transmitted pixel array. It then controls
pixel streaming from input to final output as described in the following process: An
image stream is transferred from the host computer to the hardware board for processing
through the high speed USB2.0 communication channel. The host application
communicates with the USB port using Advanced Programming Interface (API) calls for
data input and output. According to NIOS processor instructions, embedded DMA
hardware operations transfer data from memory to the NIOS datapath and into the
hardware task logic by means of the Avalon interface. The data stream is processed
through the hardware accelerator. DMA operations stream the filtered output result from
the hardware task-logic back to buffer DDR2 memory as shown in figure 4.9.
The final result is output to the on board VGA digital to analog channel, which is
peripheral to the NIOS II processor and is supported by embedded DMA hardware
transfers. However, a digital to analog converter for a VGA port is not always
implemented on a development board, so a possible alternative is the resulting binary
image to be channeled back to the host computer via the USB connection for display.

4.6

Summary

This chapter describes the implementation of embedded edge detection hardware


accelerator. The detail of sub module design is also presented and discussed with block
diagram to explain how the sub module design support for the complete design. The
controlling unit also discussed together with changing of the states which is represented
in finite state machine (FSM). Some modification on hardware accelerator design are
presented during integration stage to accommodate system interconnect fabric of Altera
Avalon bus interfacing together with other peripheral component. The Software
programming is presented in the last process of implementation to check the design is
able to demonstrate the functionality.

Das könnte Ihnen auch gefallen