Beruflich Dokumente
Kultur Dokumente
Introduction
Datapath Architecture
The datapath design shows the interaction between block accelerator and memory. These
2 blocks are assisted by other blocks which can achieve memory bandwidth reduction.
The datapath consists of 5 inputs and 2 outputs which are composed by 8 sub design
modules. Those 8 sub design modules are a clock divider controller which converts block
from 110MHz clock input into 54 MHz with 18.5ns pulse width. This pulse is needed to
support the operational of the accelerator control block and control address block. The
second block is the accelerator control. This block performs Sobel edge detection
computation process and also acts as a controller for the flow of image data from and to
the memory unit. The output of the accelerator control block will utilize the input of the
following blocks: address counter, memory access controller, and acknowledge
generator. The next block is the address counter. The address location for the original
image data and derivative image data is assigned by the address counter block. The
assignment of the specific address is performed by increment on the base address for the
respective image data pixel. The other block is the memory access controller to control
the changing format of data when accessing from memory, from parallel to serial or vice
versa. It is also in charge of memory control access register block to receive data from
memory or to transmit the data to memory. One more block is memory access register
also connected to the supporting function for memory access. The data is temporarily
saved in this block before being sent to or received from memory. Once the packet of
data is received or sent completely this block will trigger acknowledge generator to
produce an acknowledge signal. Another block is the memory unit where data pixel,
derivative address, 1st row rawpix address, 2nd rawpix address, and 3rd rawpix address
are stored. The address decoder block determines the address to be accessed in the
memory unit. The last block is acknowledge generator block which indicate the data has
been completely read or written, then the acknowledge generator block will produce an
acknowledge signal. The detail regarding 8 sub design modules is described below:
4.2.1 Clock Divider
The design of clock divider module is illustrated on figure 4.1.
Clk_in
Clk_en
Clock_Div
Rst_in
Figure 4.1 Clock divider
As shown on above block diagram, the input ports are Clk_in and Rst_in and the output
port is Clk_en. The input of Clk_in is 110 MHz clock source with the duty cycle 50%.
The expecting output clock rate is 54 MHz which is half of input clock source. The input
clock source is divided by factor of 2 before come to this output clock rate. The duty
cycle for 54 MHz is also 50%.
Clk_en
Start
Accelerator Control
State
Rst_in
Clk_en
State
Selector
Rst_in
Address Counter
Mem_Addrs
Data_in
Figure 4.3 Address counter
These 4 input ports e.g. Clk_en, State, Selector and Rst_in is the condition for the output
port of Mem_addrs.
4.2.4 Memory Access Controller
Memory access controller arranges data format in memory access. The 3 input ports e.g.
Clk_en, State and Mem_Addrs will determine which data format will produced by output
port of Reg_ctrl.
Clk_en
State
Mem_Access_Ctrl
Reg_ctrl
Mem_Addrs
drs
Figure 4.4 Memory Access Control
4.2.5 Memory Access Register
The input ports value of memory access register is coming from the output of the
following modules e.g. clock divider, address counter and memory access control. These
3 input ports will determine the following output ports e.g. mem_en and ack_in as shown
below:
Clk_en
Reg_Ctrl
Mem_en
Mem_Access_Reg
Ack_in
Mem_Addrs
drs
Figure 4.5 Memory Access Register
4.2.6 Address Decoder
Address decoder is a module to interpret the address of requesting data in memory. This
module has 1 input port and 1 output port as shown in figure 4.6.
Addrs_in
Addrs_Deco
Deco_out
Ack_in
Ackn_gen
Ack_out
State
Figure 4.7 Acknowledge Generator
4.2.8 Memory Unit
Memory unit is a module where information about data pixel and addresses for derivative
and 3 rows of rawpix (1st row, 2nd row and 3rd row) is located. Each of this information is
stored in respective register and access to the register controlled by address decoder.
Figure 4.8 shows memory unit.
Clock
Reset
State
Deco_out
Memory Unit
Mem_out
Mem_en
Figure 4.8 Memory Unit
Controlling Unit
The controlling unit is composed by the scenario of event. The design of scenario is
performed by finite state machine (FSM). In FSM modeling, the event is represented by
state. The FSM model for accelerator control has 11 states as shown in table 4.1.
STATES
REMARK
CONDITION
IDLE
RESET=1
START
RUN=1
ST
LOAD=00
ND
LOAD=01
RD
LOAD=10
READ 1
READ 2
READ 3
ACKNOWLEDGE
IF THE STATES OF
C, D, E, G, H, I,J
AND K ARE
COMPLETED THEIR
FULL CYCLING
10
11
ACK=10
ADJACENT PIXELS
ADJACENT PIXELS
ADJACENT PIXELS
STATE G:
COL=180
STATE H:
COL=180
STATE I: COL=180
ADJACENT PIXELS
ACK=11
It is begun with idle state where initiated by reset. The start state is activated by set run
value to 1. The following states are read 1st row rawpix, read 2nd row rawpix and read
3rd row rawpix. These states are activated by read=01, read=10 and read=11,
respectively. These three states are representative of 1st row data image pixels, 2nd row
data image pixels and 3rd row data image pixels that read parallel from memory. The
acknowledge state will activated when each of these states is reached 720 data pixels.
The counter is deployed for counting up to 720 data pixels for each of 1st row rawpix
state, 2nd row rawpix state and 3rd row rawpix state. The next state is 1st group
computation that activated by ack=10. This state will complete computation for 180
pixels in order to activate 2nd group computation. The same as 1st group computation,
this state will complete the computation for 180 pixels in order to activate 3rd group
computation. The state of 3rd group computation also will complete the computation for
180 pixels in order to move to acknowledge state. These 4 computation states are
representative for derivative computation for each of 4 adjacent pixels within one row of
data image pixels that have been read through read 1st row rawpix, read 2nd row rawpix
and read 3rd row rawpix. The same as read states, these 4 states also deploy a counter for
counting up to 180 pixels. The write derivative state will activated by ack=11. This state
will complete writing for 480 rows in order to comeback to acknowledge state. The
counter also deployed for this state for counting up to 480 rows. And finally acknowledge
state move to idle state with the condition of the ack=01. The complete design of
accelerator control is shown in figure 4.9.
PX<720
C
K
Load=00
PY<720
Row<480
Col<180
Ack=11
Load=01
B
Run=1
Ack=10
Col<180
Load=10
PZ<720
Reset
Ack=01
A
E
Col<180
I
J
Col<180
Design Integration
The design integration is a process to integrate all the sub module designs to create a new
complex design. As mentioned in beginning of this chapter, the design implementation is
performed on Altera FPGA Stratix III DSP development board. Therefore, design
integration utilizes the Computer Aid Design (CAD) software tool from Altera. Its
involving 2 type of CAD software tools e.g. DSP Builder and System On a
Programmable Chip (SOPC) Builder. The design integration process is divided in 2 parts.
The first part is to transfer all design modules in previous sub chapter into DSP Builder
model based. The second part is to integrate DSP Builder model based design with Nios
II processor and others peripherals such as Memory, Input Output (I/O), etc. The figure
4.9 shows the expected integrated architecture design.
NIOS II
DMA
DMA
Input Buff.
DDR2
H/W accelerator
Sobel Edge
Detector
Buff.
DDR2
Output
SRAM
buff.
VGA
Controller
Display
In the transferring design modules in previous sub chapter into DSP Builder model based,
the interfacing with others component in above architecture must take into account. As
the design in previous sub chapter is stand alone, system interconnect fabric is required
for the integrated architecture. Some modification is needed to accommodate system
interconnect fabric. As shown in figure 4.9 the system utilize NIOS II processor,
therefore Avalon bus interfacing is deploy for the system interconnect fabric because it is
NIOS II standard bus interfacing. There are 3 components of Avalon bus involved during
integration process i.e. Avalon MM Write, Avalon MM Read and Avalon Control Port.
For modification of the hardware accelerator it is necessary to include ports for
interfacing purposes because the data traffic flows through these ports. The hardware
accelerator datapath controls the flow of data. There are 3 main activities involved inside
the hardware accelerator datapath i.e. read data in memory, derivative computation and
write derivative result into memory. Read and write activities are performed by direct
memory access (DMA). DMA reads pixel data from the memory and puts into 3 parallel
register 32 bit each accommodating 4 pixels of 8 bits each. These register also parallels
supply pixel by pixel into derivative computation arrays that executed by pipelining. The
derivative computation result will move into a 32 bit result register that locate 4 pixels
with 8 bits each of pixels. Once this register loaded by 4 pixels the DMA will write the
data pixel subsequently into memory. When DMA performs read and write operations in
memory, it requires an address generator. The address generator will perform counting
from the base address by reading pixel data and writing the derivative result. The inputs
such as data pixel, offset counter enable signal, enable signal for both read operation and
write operation are required to develop datapath for the address generator. The datapath
determines pixel addresses using two base address registers: pixel read register and
derivative write register. The main memory capacity is 8Mbytes which is organized into
1M x 32bits, equal to 220 or 20 bits. The pixel data input and the output of the address
selector is implemented into 20bits. Pixel offsets count from the base address which starts
at 0 and increments by 1 for each group of 4 pixel data read from memory. Offset is
added to the base address for the 1st read register the address. The 2nd read register
address is reached by adding 720/4 to the 1st read register address. Lastly, the 3rd read
register address is achieved by add the 1440/4 to the 1st reading register address. The
counting for the derivative result address starts from 720/4 offset value and increases by 1
for each memory write the address selector will select the appropriate calculated address
to produce the memory address. The address generator datapath has 4 total registers:
pixel read register, derivative write register, offset pixel read register and offset derivative
write register. The pixel read register and derivative write register represents the base
address of the pixel image data and derivative image data, accordingly. The offset pixel
read register and the offset derivative write register represent the counting of pixel groups
read and derivative results written, accordingly. Together all these 4 registers are
controlled by a signal that is generated from the hardware accelerator control design
block. The combinational blocks are deployed to add four address signals belong to 1st
read register, 2nd read register, 3rd read register and write derivative. These 4
combinational outputs will reach the address selector. The address selector is represented
The remaining integration aspect of the hardware accelerator system is Avalon bus
interfacing signals. All the signals correspond to the input and output of the hardware
accelerator. The signals are assigned when the integration process with SOPC builder is
performed.
The hardware accelerator parts described previously were implemented as components of
an embedded system controlled by a NIOS II processor. The NIOS II program is used for
data streaming control. In order to make a complete embedded system it is required to
involve other peripheral component such as: on-chip memory for Avalon memory-map
(MM) slave; interval timer; DMA controller; parallel input-output (PIO); JTAG UART;
system ID peripheral, etc. Those peripheral components are Intellectual Property (IP)
cores provided by Altera which is added during SOPC process integration as shown in
figure 4.11.
4.5
Software Programming
Now the design is downloaded into FPGA. In order to make the design perform
functionality, it requires software to control the hardware. The software is programmed
on NIOS II processor using its standard instruction code, according to the specifications
of NIOS II Hardware Abstraction Layer (HAL). The software is originally programmed
in ANSI C and has a two main purpose:
(1) It controls hardware operations, like DMA transfers between hardware units. It
also offers a programming interface for handling data channels, with API
commands like open, read, write and close.
(2) It allows the system to perform simple software operations on the input data
instead of using dedicated hardware stages for such processing. For example Nios
instruction code can be used to convert image arrays into appropriate onedimensional data streams.
NIOS instruction code is downloaded to on-chip memory. The code for the particular
design opens the USB port and waits to read the transmitted pixel array. It then controls
pixel streaming from input to final output as described in the following process: An
image stream is transferred from the host computer to the hardware board for processing
through the high speed USB2.0 communication channel. The host application
communicates with the USB port using Advanced Programming Interface (API) calls for
data input and output. According to NIOS processor instructions, embedded DMA
hardware operations transfer data from memory to the NIOS datapath and into the
hardware task logic by means of the Avalon interface. The data stream is processed
through the hardware accelerator. DMA operations stream the filtered output result from
the hardware task-logic back to buffer DDR2 memory as shown in figure 4.9.
The final result is output to the on board VGA digital to analog channel, which is
peripheral to the NIOS II processor and is supported by embedded DMA hardware
transfers. However, a digital to analog converter for a VGA port is not always
implemented on a development board, so a possible alternative is the resulting binary
image to be channeled back to the host computer via the USB connection for display.
4.6
Summary