Sie sind auf Seite 1von 65

Universitat Bielefeld Technische Fakultat AG Technische Informatik

Diplomarbeit

A parallel implementation of path planning on recongurable hardware

Marcus Grieger November 11, 2004

Betreuer Prof. Dr. Ralf Moller Dipl.-Inform. Tim Kohler

Diese Diplomarbeit wurde von Prof. Dr. Ralf M ller und Dipl.-Inform. Tim K hler o o betreut. Gutachter waren Prof. Dr. Ralf M ller und Dipl.-Inform. Tim K hler. o o

Universitat Bielefeld, AG Technische Informatik

Contents

Contents
1 Introduction 1.1 Dijkstras shortest path algorithm . . . . . . . . . . . . . . 1.2 Path planning on binary maps . . . . . . . . . . . . . . . . 1.3 Mapping of binary map nodes on recongurable hardware 1.4 Applications of grid based shortest path algorithms . . . . 1.5 Related projects . . . . . . . . . . . . . . . . . . . . . . . 5 5 6 6 6 7 9 9 9 9 12 13 17 17 18 19 21 21 23 25 29 32 34 37 37 40
3

. . . . .

. . . . .

. . . . .

. . . . .

2 The Dijkstra Algorithm 2.1 The Dijkstra Shortest Path algorithm . . . . . . . . . . . . . . . . 2.1.1 Initialisation . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Simplication of the Dijkstra algorithm for a mapping of graph vertices on digital hardware in grid-based applications . . . . . . . . 2.2.1 Tracing of the shortest path in the potential vector . . . . 3 Field Programmable Gate Arrays (FPGAs) 3.1 History of FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Internal structure of FPGAs . . . . . . . . . . . . . . . . . . . . . 3.3 Suitability of FPGAs for the specied application . . . . . . . . . 4 Implementation of a Prototype 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Design of graph vertices in digital hardware . . . . . . . . 4.3 Generation of the node matrix by a Verilog code generator 4.4 Interface to a host system . . . . . . . . . . . . . . . . . . 4.5 Software environment on the host system . . . . . . . . . 4.6 Result of a test run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5 Analysation and Discussion 5.1 Substitution of DFFs by LUT based delay elements . . . . . . . . 5.2 Low-level instantiation of logic elements by VQM source les . .
Universitat Bielefeld, AG Technische Informatik

Contents

5.2.1 Structure of VQM les . . . . . . . . . . . . . . . . . . . . 5.2.2 Implementation of the LUT delay element in VQM . . . . 5.3 Reduction of the number of needed DFFs by using an alternate path tracing algorithm . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Implementation of the address logic and host interface . . . . . . 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Large format schematics A.1 Schematic of the top level entity . . . . . . . . . . . . . . . . . . . A.2 Schematic of a graph vertex implemented in digital hardware . . B Source code B.1 main.m - Main module of the Verilog code generator . . . . . . . B.2 serial.c - Support functions for the host utilities . . . . . . . . . . B.3 map transfer.c - Host utility to transfer a map to the FPGA . . . . C Bibliography

40 45 46 48 50 53 53 54 55 55 58 61 63

Universitat Bielefeld, AG Technische Informatik

1
Introduction
One of the problems in mobile robotics is navigation and path planning on maps. An approach to nding a shortest path is to encode the map in a weighted graph and use Dijkstras shortest path algorithm [Min78] to nd the shortest path between two nodes in this graph. The downside of this algorithm is a quadratic growth of complexity dependent on the number of nodes. This work describes a parallel implementation of a simplied version of Dijkstras algorithm to achieve a linear growth.

1.1

Dijkstras shortest path algorithm

The common version of Dijkstras algorithm solves an all-shortest-path problem 1 in a weighted graph. Each vertex represents a location on a map and is connected to one or more other vertices via weighted edges, the weight representing a distance or cost incurred by following this edge. After dening a destination or reference vertex, the algorithm iteratively calculates a distance potential in every vertex which represents the distance to the destination node. The shortest path from any starting vertex is found by subsequently following the edge to the neighbouring vertex with the lowest potential. A detailed formal description of this algorithm is found in the rst part of chapter 2.
A solution to an all-shortest-path problem provides the shortest path from a reference vertex to any other vertex of the graph, not only the shortest path between two specic vertices.
1

Universitat Bielefeld, AG Technische Informatik

1 Introduction

1.2

Path planning on binary maps

A possible representation of a map generated by a mobile robots sensors is a simple two dimensional binary map, where a logical zero represents free space and a logical one represents obstacles. This map is converted to a graph by placing a vertex in every pixel of the map and connecting it with eight edges to its eight neighbours. The weights result from the simple Euclidian distance of 1 to orthogonal neighbours and 2 to diagonal neighbours. Obstacle vertices get an innite weight or are not connected at all.

1.3

Mapping of binary map nodes on recongurable hardware

In a classic implementation, each iteration requires a sequential calculation of each vertex potential to be updated, therefore a parallel calculation of all map nodes on suitable hardware should result in a signicant reduction of complexity and thus in great acceleration of this algorithm. To achieve this goal, more simplications are made to allow for a creation of an electronic equivalent of a graph vertex. An interconnected matrix of these vertices can be implemented on a eld programmable gate array (FPGA) and the calculation of one iteration is possible in a single clock cycle.

1.4

Applications of grid based shortest path algorithms

In addition to the acceleration of ofine and real time path planning in mobile robots, variants of this algorithm could be used in other applications: A quite similar problem of nding a shortest path can be found in so-called maze routing, where conductive tracks on a printed circuit board have to be routed between pins of components without overlapping or crossing other tracks. This variant starts with a clear map and every routed track becomes an obstacle for the following tracks. This method of routing normally uses only orthogonally connected vertices, but in the case of a multilayer board, the map may become three dimensional. Another application could be segmentation in the eld of image processing. A comparison of region growing algorithms to the wave propagation in this algorithm suggests using a variant that uses more than one reference vertex as initial region seeds and some binary tag being propagated with each wave to distinguish between different regions later.

Universitat Bielefeld, AG Technische Informatik

1 Introduction

1.5

Related projects

[Nes03] describes an implementation of a maze-routing accelerator, based on Xilinx XC4000 and Xilinx VirtexII FPGAs. An implementation of an accelerator for the calculation of a shortest path in a common graph is described in [TS01]. This approach aims at the acceleration of the Dijkstra algorithm used in dynamic network routing.

Universitat Bielefeld, AG Technische Informatik

1 Introduction

Universitat Bielefeld, AG Technische Informatik

2
The Dijkstra Algorithm
2.1 The Dijkstra Shortest Path algorithm
The Dijkstra Shortest Path Algorithm is described in [Min78]. To solve an all-shortestpath problem, each location is assigned to a vertex v V in an undirected graph G = (V, E). All locations reachable from this vertex are then connected by a set of weighted edges E. A weight represents a distance or cost incurred by following the path along this edge. The algorithm involves an initialisation step and a number of iterations:

2.1.1

Initialisation

A reference or target vertex t V is dened and a distance potential p(x) is assigned to every x V . This potential is initialized with p(x) = 0 for x = t and p(x) = for all other vertices: p(t) = 0 p(x) = (2.1) (2.2)

x = t, x V.

Initially, all vertices x = t and all edges are uncoloured. Vertex t is coloured and becomes the most recently coloured vertex named i.

2.1.2

Iteration

For each uncoloured vertex x that receives an edge from i, p(x) is updated as the minimum of its own potential and the potential of vertex i in addition to the cost of the connecting edge:
Universitat Bielefeld, AG Technische Informatik 9

2 The Dijkstra Algorithm

p(x) = min(p(x), p(i) + w(i, x))

(2.3)

If uncoloured vertices with a nite potential exist, the vertex j with the lowest potential is coloured and i is set to j. Additionally, the edge that determined the value of p(x) is coloured. Else, no further vertices are reachable and the algorithm stops. If the vertex s representing the starting location has been coloured, the shortest path from this location to the reference node t has been found and can be obtained by following the coloured edges.

10

Universitat Bielefeld, AG Technische Informatik

2 The Dijkstra Algorithm

Figure 2.1: Example of a shortest path calculation in a weighted graph from vertex s to t using the Dijkstra method. The reference vertex t is initialized with p(t) = 0, all other vertices are initialized with p(x) = , x = t, denoted by a blank node. In f), the starting node s is coloured and the shortest path from s to t is represented by bold (coloured) edges.

Universitat Bielefeld, AG Technische Informatik

11

2 The Dijkstra Algorithm

2.2

Simplication of the Dijkstra algorithm for a mapping of graph vertices on digital hardware in grid-based applications

To solve a path planning problem on a binary map, the vertices of the graph G = (V, E) are placed in a regular grid and each vertex v V is connected with its eight neighbours via edges with constant weights w(y, x) = 1 for orthogonal edges and w(y, x) = 2 for diagonal edges. For this special case, [M l99] suggests a simplication of the Dijkstra algorithm that o allows an implementation of a graph vertex in simple digital hardware. The rst simplication suggested by the accelerated Dijkstra method as described in [M l99] is to use integral weights. Incorporating a factor of 2, weights are w(y, x) = 2 o for orthogonal edges and w(y, x) = 3 for diagonal edges. This eliminates the need to implement complex combinatorial circuits for real number arithmetics and allows an encoding of integral weights through an equivalent number of delay elements in hardware. In the original Dijkstra algorithm, the vertex with the lowest potential is coloured in each iteration. To achieve the same goal, the implementation proposed by [M l99] o introduces a global counter that is incremented in constant periods of time. At the start of a calculation, a binary signal is propagated from the reference node t to the neighbouring nodes. This signal then spreads through the entire graph, delayed by a number of clock cycles equivalent to the weight of the edges. When the signal reaches a node, the value of the global counter is latched in a register inside the node. The time needed by the signal to reach a node is equal to the number of delay elements traversed and thus equal to the sum of the weights of the edges involved. Since the global counter increments equally to the time needed, the latched value represents the same sum of the weights. All incoming signals of the neighbouring edges of a node are OR combined, so the earliest signal locks this node. Therefore, the OR combination is the corresponding equivalent to the minimum function from the original algorithm. Due to the integral weights and the regularity of the grid, multiple nodes may obtain the same minimal potential at a time. Because the distance between two neighbouring nodes is always shorter than any other path between these nodes, all nodes with the same potential can be coloured at once. This is the main reason why a parallel implementation of this algorithm is possible. By using this method, the maximum number of iterations that are needed to calculate all potentials of reachable nodes in a given n by m grid can be approximated: In
12 Universitat Bielefeld, AG Technische Informatik

2 The Dijkstra Algorithm

a worst case scenario, every other row of the map has to be traversed almost completely. The rows in between are completely marked as obstacle nodes, except for a one node wide passage alternating at the left and right border of the map. So the only path through this map is serpentine shaped (see example in gure 2.2). If the diagonal edges at the turns are neglected, an upper approximation of the length of the longest path is (m 1) (n 1) 2 and the maximum potential pmax is twice that number: pmax = (m 1) (n 1) (2.5)

(2.4)

Figure 2.2: Serpentine shaped map for a worst-case scenario

After the number of clock cycles exceeds pmax , all reachable nodes still have a potential value lower than the nal value cmax of the global counter. Thus, these can be considered coloured and all nodes with a potential equal to cmax are unreachable. This elimiates the need to encode the colouration of nodes.

2.2.1

Tracing of the shortest path in the potential vector

The colouration of edges can also be omitted if an iterative tracing algorithm on the resulting potential vector is used.

Universitat Bielefeld, AG Technische Informatik

13

2 The Dijkstra Algorithm

Initialization v0 = s sV (2.6)

v denotes the resulting path, v0 = s is the starting location. Iteration In each iteration step, the following vertex vn+1 is set as the neighbouring vertex with the minimal potential. When the reference vertex t is reached at T , the algorithm stops and v0 , . . . , vT describes the shortest path from s to t. vn+1 = arg min (p(x)) (2.7)

xconnected tovn

Figure 2.3 shows an example of the resulting algorithm at different values of the global counter.

14

Universitat Bielefeld, AG Technische Informatik

2 The Dijkstra Algorithm

Figure 2.3: Calculation of node potentials in the simplied algorithm. The values for the global counter are a) 0, b) 2, c) 4, d) 5, e) 6, f) 7. Red edges have a weight of 2, blue edges have a weight of 3, and green edges are unconnected.

Universitat Bielefeld, AG Technische Informatik

15

2 The Dijkstra Algorithm

16

Universitat Bielefeld, AG Technische Informatik

3
Field Programmable Gate Arrays (FPGAs)
3.1 History of FPGAs
Today, FPGAs are used mainly for rapid prototyping of application specic integrated circuits (ASIC). In the past, the demand for integrated circuits for all kinds of applications steadily increased, but the development of a full-custom chip was an expensive process. Originally, the creation of a chip required the design of custom wafers. Special clean rooms and equipment for photolithography and etching of these wavers were needed for the creation of a prototype, and the whole process was very uneconomical if only a limited-lot production was needed. One of the early ancestors of FPGAs was the Uncommited Logic Array (ULA) or Gate Array (GA). The die of an ULA consists of an array of unconnected standard logic gates and other elements needed for the creation of an application specic integrated circuit (ASIC). These elements were connected later by applying metal interconnections to an additional layer of the die. The possibility to mass produce ULAs for different applications reduced the overall design cost compared to the creation of full-custom chips. The creation of an ASIC prototype using an ULA was still a complex process, since the application of the metal layer could not be done by the designer. In those days, the design process started to change - the designer no longer had to create the wafer mask of the die, but the new task was to connect already existing parts of the chip to implement the desired functionality.
Universitat Bielefeld, AG Technische Informatik 17

3 Field Programmable Gate Arrays (FPGAs)

The desire to have user programmable chips arose and led to the invention of programmable array logic (PAL) and gate array logic (GAL) chips. A PAL consists of a matrix of AND and OR gates and the possibility to create interconnections at specic points in this matrix by using fuse or antifuse technologies similar to those used in PROMs and EPROMs. A GAL is a more complex version, which basically includes several PALs on one die. These can be connected with each other to create more complex logical functions. At this point, the developer no longer needed to send his metal connection mask to the ULA manufacturer, but he could obtain readily available chips and program the needed interconnections using programming hardware on site, which was much cheaper. Another step to speed up the development of projects involving ASICs was the invention of in-circuit-programming methods. The rst user programmable chips had to be removed from and reinserted into a socket for each test of a new version. This not only took time, but uncareful handling and mechanical stress from a repeated removal and insertion could also lead to a destruction of the component. As a result, the Joint Test Action Group (JTAG) developed a standard interface which allowed programming of a device by using a simple serial port protocol without the need for a removal from the circuit board. The growing complexity of ASIC designs eventually led to the invention of the eld-programmable gate array (FPGA) in the early 1980s. An FPGA consists of a lot of logic cells arranged in a two dimensional array, which can be interconnected in almost any way1 . With computers becoming faster, the design process evolved simultaneously. More and more tasks of the design process were implemented in software tools to aid the designer. This led to todays complex development environments that allow a prototype to be created and simulated on almost any personal computer. Two of the main FPGA manufacturers nowadays are Altera and Xilinx.

3.2

Internal structure of FPGAs

FPGAs consist of basic elements called logic blocks. These are arranged in a two dimensional array and can be connected by a programmable connection matrix. To connect internal logic to external circuits, I/O blocks at the border of the array can also be connected to the logic blocks. Newer devices may also contain a number of additional features like clock generators, larger blocks of on-chip memory, digital signal processor (DSP) components and even complete CPU cores.
1

only limited by connection ressources, but there are plenty for most practical applications

18

Universitat Bielefeld, AG Technische Informatik

3 Field Programmable Gate Arrays (FPGAs)

The logic blocks consist mainly of one or more lookup tables (LUTs), in most cases 4-input LUTs, to implement logical gates. Most FPGAs implement the LUTs as well as the conguration registers for the connection matrix in SRAM technology to achieve high overall chip clock rates. This requires a reconguration of the device after each power-up. Some alternative implementations use non-volatile concepts like EEPROM or Flash technology. Those are not as fast as an SRAM based device, but no external components for the conguration of the logic device is needed after an initial programming process. In addition to the LUTs, ip-ops are provided to implement storage registers in a logic element. Depending on the manufacturer and the particular device, these logic blocks are called logic array blocks (LABs, Altera) or congurable logic blocks (CLBs, Xilinx). Most newer devices partition these basic blocks into smaller elements. While a Xilinx XC4000 2 CLB contains two LUTs and two registers only, in the Virtex II3 FPGA, one CLB is divided into four slices which contain two LUTs and two registers each. An Altera Stratix 4 LAB includes ten logic elements (LEs), which provide one LUT and one register each. The more advanced Stratix II 5 provides eight adaptive logic modules (ALMs) per LAB. An ALM contains a comparable larger LUT that can be used in different congurations regarding the number of input signals. It also includes two registers. The programmable interconnections differ dependent on manufacturer and device, too. All devices provide a varying number of horizontal and vertical signal lines of variable length across the chip, though. Additionally, most devices support global clock signals to distribute a low skew clock to every part of the chip. One main difference between Altera and Xilinx devices is that Altera uses a column oriented approach while Xilinx maintains a true grid structure. In an Altera FPGA, the LABs are placed in columns thus providing very fast interconnections to the neighbouring LABs in a row.

3.3

Suitability of FPGAs for the specied application

The simplied algorithm described in chapter 2.2 can be realised with a very small number of gates and registers per node. This fact and the arrangement of the nodes
2 3

XC4000 data sheet: [Cor99] Virtex II data sheet: [Cor03b] 4 Stratix data sheet: [Cor04b] 5 Stratix II data sheet: [Cor04c]

Universitat Bielefeld, AG Technische Informatik

19

3 Field Programmable Gate Arrays (FPGAs)

in a grid suggest the use of a similar structured silicon device. Besides the expensive creation of an ASIC or a full-custom chip, the implementation on an FPGA seems to be the only reasonable option. An efcient parallel implementation of the algorithm can be achieved if it is possible to map one graph vertex to a single logic block of an FPGA. The selection of an appropriate FPGA for this algorithm is difcult, because these days, a broad range of devices optimised for different applications are commercially available. There are low-budget as well as high performance solutions with a lot of different additional features from both Xilinx and Altera, and there are other devices from other companies like Actel, Quicklogic, and Lattice, too. For this project, the Altera Stratix device family was chosen in favor of a Xilinx product for one main reason: Compared to XC4000 and Virtex II from Xilinx, they offer more ip-ops per basic logic block. Since the maximum counter value depends on the size of the map, a larger number of registers to store that value is desirable to maximize the possible map size.

20

Universitat Bielefeld, AG Technische Informatik

4
Implementation of a Prototype
A running implementation of the algorithm described in 2.2 was created using an Altera Nios Statix Edition prototyping board[Cor03a] on an Altera Stratix 1S10 FPGA running at a clock rate of 50 MHz. This version allows the calculation of potentials in a map as big as 16 by 16 pixels. A graph node is implemented by using standard elements from the Quartus II design software, possible optimisations and problems associated with this implementation are discussed in the next chapter.

4.1

Overview

The prototype consists of the FPGA development board which is connected to a host system by a RS232 serial connection. On the host, a set of utilities described later is used to transfer map data to the FPGA, execute the calculation and retrieve the resulting potential vector.

Figure 4.1: System overview of the prototype

Universitat Bielefeld, AG Technische Informatik

21

4 Implementation of a Prototype

Figure 4.2: Top level entity of the FPGA design. A larger version of this gure can be found in appendix A.1

The FPGA top-level design (Figure 4.2) uses the following components: node matrix The node matrix component contains a matrix of 16 by 16 graph nodes for the calculation of the potentials. It is created automatically by an Objective C program described in 4.3 that generates a Verilog source le which contains the instantiation and interconnection source code for single graph nodes. command register, decoder, msr These components implement a state machine to execute instructions received via the universal asynchronous receiver/transmitter (UART). The output of the command register is wired to eight on-board LEDs to display the contents of this register for debugging purposes. address register, address decoders The address register stores the address of a node selected by the ADDRESS instruction. The binary address is then decoded to an array of unary selector lines by the row decoder and col decoder. Also, the node address is decoded by two
22 Universitat Bielefeld, AG Technische Informatik

4 Implementation of a Prototype

seven-segment-decoders to display the current address on the seven segment elements on the Nios board. UARTS The UART core was obtained in VHDL source from Bertrand Cuzeau of ALSE[ALS03] with permission to use in this project. It is connected by two data busses for incoming and outgoing data and a few signal lines indicating the receive and transmit states. A variant of this design supporting larger addresses was made to test an implementation with maps larger than 16 by 16 nodes. This included a second status register and an extension of the instruction decoder to transfer these addresses. An implementation which could handle a map sized 24 by 24 nodes could be compiled successfully for the Stratix 1S10 FPGA.

4.2

Design of graph vertices in digital hardware

The rst implementation of a graph node is a modication of the design suggested in [M l99] It uses three D-Flip-Flops (DFF) to implement the delay elements and a fourth o DFF to store a conguration bit to mark obstacles. More DFFs are used in the lpm counter megafunction of the Quartus II software which can implement an arbitrary width binary counter. The prototype used a ve bit wide counter, this results in a total of nine DFFs per vertex used in this design.

Figure 4.3: Implementation of a hardware graph vertex. A larger version of this gure can be found in appendix A.2

Universitat Bielefeld, AG Technische Informatik

23

4 Implementation of a Prototype

in0, . . . , in7 are the input signals for incoming edges of the neighbour vertices. in0, . . . , in3 are connected to the orthogonal and in4, . . . , in7 are connected to the diagonal neighbours of the current vertex. Each of these four-input sets is OR combined to generate the orthogonal edge and diagonal edge signals. The diagonal edge signal is connected to the diagonal delay DFF resulting in a delay of one node clock cycle. This DFFs output is then OR combined to the orthogonal edge signal and the result is fed to the next delay DFF, delay1. Finally, the output of delay1 is delayed again by DFF delay2. This implements the fact, that a diagonal edge has a weight of 3 while an orthogonal edges weight is 2. Another OR gate feeds the output of delay2 back to its input to keep a high state, regardless of the output of delay1. This feedback allows to implement a preset logic for the reference node. The output of delay2 is masked with the output of the conguration bit by an AND gate. If the output of the DFF which contains the conguration bit is low, this node is marked as an obstacle and the output will stay low, regardless of any incoming signals. The masked signal out is then distributed to the neighbouring nodes inputs and simultaneously used as a low active clock enable signal for the counter present in each node. In contrast to the suggested implementation in [M l99], no global counter is used o in this variant. As long as the output of a node is low, the counter is incremented with each cycle of node clk. To set the conguration bit of a particular node and to set a reference node, a simple address logic is used. Two binary to unary decoders are connected to the upper and lower half of the address register respectively to generate a number of row select and column select signals according to the number of rows and columns of the map. These decoders assure that only one row and column select signal is high active at any given time. Each node of a given row n is connected to the corresponding row select[n] signal, each node of a column m is connected to col select[m].

A node select signal is generated by performing a logical and on a nodes row select and col select inputs. This allows the transfer of a conguration bit to the node currently selected by the address register using the conf data and conf clk inputs common to all nodes. Also, the node select signal and the common conf set signal are fed into a NAND gate. This gates output is connected to the asynchronous preset input of the delay2 DFF, which allows a preset of the currently selected node as a reference node. Additionally, the node select signal is connected to the enable input of counter tristate buffer to output the current nodes potential counter on the counter out bus. All conguration bits can be cleared asynchronously by assigning a low signal to the common conf reset input. The counter and delay DFFs can be reset asynchronously by assigning a low signal to node reset.
24 Universitat Bielefeld, AG Technische Informatik

4 Implementation of a Prototype

Figure 4.4: Block schematic of the address logic

4.3

Generation of the node matrix by a Verilog code generator

The current design uses a matrix of 16 by 16 nodes. Each nodes output is connected to the inputs of each of its eight neighbours. This results in a total number of 65536 connections for the output signals only. Thus, a manual placement of nodes and interconnections is impractical, but the regularity of the connections allows an automatic generation of the node matrix in a hardware description language. The input and output signals of each node can be divided into three different categories: The inputs node reset n, node clk, conf reset n, conf clk, conf set, conf data and
Universitat Bielefeld, AG Technische Informatik 25

4 Implementation of a Prototype

the output bus counter out are common signals to each node and passed through from or to the node matrix. All row select and col select selector input lines of the node matrix are connected to each node in the corresponding row and column. The denition and interconnection of these signals depends on both the size of the node matrix and the position of an individual node in the matrix. The in0, . . . , in7 inputs of each node are connected to the out signals of the eight neighbouring nodes. At the border of the matrix, a low signal is assigned to these inputs if the appropriate neighbour does not exist. The denition of these signals depends on the position and size of the matrix, too. The Verilog source code of the node matrix consists of three parts: A header containing the denitions of the node matrix module, inputs, outputs, and internal signal lines, followed by the repeated instantiation and interconnection of nodes, and a footer ending the module denition. Three template les representing the needed parts are used to generate the resulting le. Placeholder variables, marked by the #-character at the beginning and the end, represent values dependent on size of the grid and position of individual nodes. They are calculated and lled in upon execution of the code generator. The template class supports integer, oat and string variables, though only integer and string variables are used in this application.

26

Universitat Bielefeld, AG Technische Informatik

4 Implementation of a Prototype

module node_matrix(node_row,node_col, node_reset_n,conf_reset_n, conf_clk,conf_set,conf_data,node_clk, counter_out); input input input input input input input input [#maxrow#:0] node_row; [#maxcol#:0] node_col; node_reset_n; conf_reset_n; conf_clk; conf_data; conf_set; node_clk;

output [4:0] counter_out; wire wire wire wire wire wire wire wire node_clk; [#maxrow#:0] node_row; [#maxcol#:0] node_col; node_reset_n; conf_clk; conf_data; conf_set; conf_reset_n;

tri [4:0] counter_out; wire node_out[#maxnode#:0];

Figure 4.5: header.v - Template for the node matrix.v header

Figure 4.5 shows the template for the module header. For a height rows by width columns matrix, the values for the used variables are: #maxrow# = height-1 #maxcol# = width-1 #maxnode# = (height * width) -1 Those variables are used to dene the number of selector signals and output signals of all nodes.
Universitat Bielefeld, AG Technische Informatik 27

4 Implementation of a Prototype

The template for the instantiation of one node is shown in gure 4.6. node node_x#x#_y#y# ( .row_select(node_row[#x#]), .col_select(node_col[#y#]), .conf_clk(conf_clk), .conf_data(conf_data), .conf_set(conf_set), .conf_reset_n(conf_reset_n), .node_clk(node_clk), .node_reset_n(node_reset_n), .in0(#in0#), .in1(#in1#), .in2(#in2#), .in3(#in3#), .in4(#in4#), .in5(#in5#), .in6(#in6#), .in7(#in7#), .out(node_out[#n#]), .counter_out(counter_out) );

Figure 4.6: node.v - Template for the node instantiation in node matrix.v

The nodes are generated row by row and column by column in each row. In a nested loop, this template is used once for each node with #y# running from 0 to #maxrow# and #x# running from 0 to #maxcol#. Therefore, a node at row 5, column 3 gets the name node\_x3\_y5, its row select and col select inputs are connected to row select[5] and col select[3] respectively. The number #n# of a node is calculated as #n# = y*width + x and this nodes output is connected to node out[#n#]. For the above example, in a 16 by 16 matrix the output line is node out[83]. As described in section 4.2, in0, . . . , in7 are the input signals from the neighbouring nodes. Figure 4.7 shows the denition of a lookup table containing the corresponding #x# and #y# offsets for each incoming signal.
28 Universitat Bielefeld, AG Technische Informatik

4 Implementation of a Prototype

struct offset { int dx,dy; }; struct offset edge_offsets[8] = { { -1, 0 }, // left { 1, 0 }, // right { 0, -1 }, // bottom { 0, 1 }, // top { -1, -1 }, // top left { 1, -1 }, // top right { -1, 1 }, // bottom left { 1, 1 } // bottom right };
Figure 4.7: Offset table describing the position of adjacent nodes

For each input inN, the offset edge_offsets[N] is added to the current #x# and #y# to obtain the position of the neighbouring node. In case of the resulting position being out of bounds, inN is substituted by the string b0 which represents a logic zero in Verilog. Otherwise, the replacement string is node_out[x+edge_offsets[N].dx+(y+edge_offsets[N].dy)*width]

After the Verilog code for all nodes is generated, the footer template is appended. This template only contains the line endmodule to end the Verilog module denition.

4.4

Interface to a host system

An interface to the host system is needed to transfer map data to the FPGA, set reference nodes, and retrieve the resulting potential vector. For the prototype, an RS232 connection was chosen for the ease of creating the needed host software. The interface mainly consists of a ready-to-use UART connected to the address and command registers, a command decoder, and a state register, which allow to execute a set of commands received via the RS232.

Universitat Bielefeld, AG Technische Informatik

29

4 Implementation of a Prototype

The RXD and TXD data lines of the UART are connected directly to FPGA pins, which are then connected to the TTL to RS232 voltage level converters. After a byte is received by the UART, the RxRDY signal gets high for one clock cycle and data can be read from the Dout bus. The current data at the Din bus is sent to the RS232 by assigning a high signal to the LD input for one clock cycle. While a byte is transmitted, the TxBusy signal is high. Another status signal is RxErr, which indicates an error during the receipt one byte. This signal is unconnected, since no error handling for the interface is implemented and transmission errors are unlikely at a rate of 9600 bits per second. The UART supports a choice between two transmission rates, dened as parameters, by assigning a high or low signal to the Baud input. Since only a rate of 9600 bits per seconds is used, this signal is tied to GND. RST is a low-active reset signal and connected to the system reset signal. Dout is fed into the data inputs of both the command and address registers, allowing to load received data by assigning a high signal to one of the registers enable inputs for one clock cycle. counter out of the node matrix is connected to Din of the UART. This tri-state bus holds the current counter value of the node selected by the address register. The state register is used to distinguish between the wait for command state and the actual execution of an instruction. As long as it is low, the system waits to receive a byte via the UART. After uart rxrdy switches to a high value, the decoder performs a fetch command cycle by assigning a high to command enable which results in the command register to take over and store the value of the UARTs Dout bus. The msr is set to high, too, and the execution of the particular instruction starts in the next system clock cycle. Table 4.1 shows the decoder table for the instruction decoder. The actual encoding of each command can be taken from the column command register. The following commands are implemented: ADDRESS <n> Two states are needed for the ADDRESS instruction to load a node address to the address register. First, the system stays in the load address wait state until a second byte is received via the UART. After another transition of uart rxrdy to a high value, the decoder assigns a high signal to address enable and the value of Dout is taken over by the address register in the load address load cycle. READ The READ instruction is implemented in two states, too. As long as the uart txbusy signal is in a high state, the system will stay in this state to wait for an ongoing transmission of a byte to nish. When the UART is ready to transmit
30 Universitat Bielefeld, AG Technische Informatik

4 Implementation of a Prototype

command register

command enable

address enable

node reset n

uart txbusy

conf reset n

uart rxrdy

node clk

conf clk

conf set

conf data 0 0 0 0 0 0 0 0 0 b

state wait for command fetch command load address wait load address load read wait read send reset nodes reset map iterate load cong

0 0 1 1 1 1 1 1 1 1

x x b00000001 b00000001 b00000010 b00000010 b00000100 b00001000 b00010000 bab100000

0 1 0 1 x x x x x x

x x x x 1 0 x x x x

0 0 0 1 0 0 0 0 0 0

0 1 0 0 0 0 0 0 0 0

uart ld

0 0 0 0 0 1 0 0 0 0

0 0 0 0 0 0 0 0 1 0

1 1 1 1 1 1 0 1 1 1

0 0 0 0 0 0 0 0 0 1

1 1 1 1 1 1 1 0 1 1

0 0 0 0 0 0 0 0 0 a

Table 4.1: Decoder table of the instruction decoder

new data, a high signal is assigned to the LD input in the read send state. This triggers a transmission of the currently assigned value of the Din port to the host system via the RS232. RESET NODES The RESET NODES instruction causes the node reset n output of the decoder to become low for one clock cycle. This resets the counter and delay registers in the node matrix. RESET MAP This instruction works similar to the RESET NODES command. By assigning a low value to conf reset n for one cycle, the map conguration registers in the node matrix are reset. ITERATE The ITERATE instruction causes a high value on the node clk output of the decoder, which results in a clock cycle for both the counter and delay registers of the node matrix. This is used to perform iterations of the implemented algorithm. LOAD CONFIG LOAD CONFIG is used to set the currently selected node as a reference or an obstacle node. Unlike the other instructions, this is the only one which encodes
Universitat Bielefeld, AG Technische Informatik 31

msr out 0 1 1 0 1 0 0 0 0 0

msr in

4 Implementation of a Prototype

additional parameters in the instruction word. These parameters, a and b in table 4.1, are passed through to the node matrix. If parameter a, the conf set signal, is set, the currently selected node is set as a reference node by asynchronously setting the last delay element to a high value. Parameter b is passed to the conf data signal and a simultaneous conf clk cycle causes the conguration DFF of the node to take over this signal. Since the output of the conguration bit is AND combined with the output of the last delay DFF, a low value has to be set for an obstacle node and a high signal for a clear node. The chosen encoding uses one bit of the instruction word to distinguish between different instructions. This keeps an implementation of the decoder in gate logic simple. Figure 4.8 shows the implementation of the decoder.

Figure 4.8: Schematic of the instruction decoder

4.5

Software environment on the host system

On the host system, a set of command line utilities written in C is used for test runs. The source code for a module containing common functions used in these utilities can
32 Universitat Bielefeld, AG Technische Informatik

4 Implementation of a Prototype

be found in appendix B.2 The utilities include: send <value> This program just sends the value supplied as an argument to the serial port. It is used mainly for testing purposes, e.g. for sending a single command byte to the FPGA. Example: $ ./send 4 # issue "reset_nodes" command to the FPGA address <node address> address sends the ADDRESS command, followed by the node address parameter. It is used for the debugging of the address logic and seven segment decoders, and for setting one or more reference nodes. Example: $ ./address 50 # 0x31 - set address to node (3,1) $ ./send 14 # 0xE0 - set node (3,1) as # clear reference node map transfer <map lename> This utility reads in an ASCII le <map lename>, resets the map conguration in the FPGA, and transmits the neccessary commands to congure the map according to the contents of the le. The rst two lines of the le contain the nummeric values of the width and height of the map. In the following lines, a space represents a clear node and any non-space character marks an obstacle. Figure 4.9 shows an example le. iterate <n> This commands sends <n> ITERATE instructions to the FPGA to perform <n> iterations of the algorithm. read To retrieve the potential vector from the FPGA, the read utility is used. It uses ADDRESS and READ NODE instructions to read every nodes potential and prints it to the console.
Universitat Bielefeld, AG Technische Informatik 33

4 Implementation of a Prototype

16 16 **************** * * * * * * * * * * * ********* **** * * * * * * * * * * * ************ *** * * **** * ******** * * * * * * ****************


Figure 4.9: Example ASCII map for the transfer map utility

4.6

Result of a test run

In a test run, the map shown in gure 4.10 encoded as an ASCII map as shown in gure 4.9 is transfered to the FPGA. An ADDRESS instruction is then sent to the FPGA to set the address register to node (1,1) and a following LOAD CONFIG instruction sets this node as the reference node: $ ./address 17 $ ./send 14 # 0x11 - set current node address to (1,1) # 0xE0 - set current node as # reference node

Figure 4.11 shows the output of the read utility after 71 iterations. These values are loaded into Matlab to generate a visualisation of the resulting potential vector in a greyscale image as shown in gure 4.12.

34

Universitat Bielefeld, AG Technische Informatik

4 Implementation of a Prototype

Figure 4.10: Map for the test run

71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71

71 0 2 4 6 71 40 41 42 43 71 59 71 61 62 71

71 2 3 5 7 71 38 39 40 41 71 57 71 59 60 71

71 4 5 6 8 71 36 37 38 39 71 55 71 57 58 71

71 6 7 8 9 71 34 35 36 37 71 53 54 55 57 71

71 8 9 10 11 71 32 33 34 35 71 51 52 54 56 71

71 10 11 12 13 71 30 31 32 34 71 49 71 71 71 71

71 12 13 14 15 71 28 29 31 33 71 47 48 50 52 71

71 14 71 71 71 71 26 71 71 71 71 45 71 51 53 71

71 16 17 19 21 23 25 27 29 31 71 43 71 53 54 71

71 18 19 20 22 24 26 28 30 32 71 41 71 55 56 71

71 20 21 22 23 25 27 29 31 33 71 39 71 57 58 71

71 22 23 24 25 71 28 30 32 34 36 38 71 59 60 71

71 24 25 26 27 71 30 31 33 35 71 39 71 61 62 71

71 26 27 28 29 71 32 33 34 36 71 41 71 63 64 71

71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71

Figure 4.11: Output of the read utility after 71 iterations

Universitat Bielefeld, AG Technische Informatik

35

4 Implementation of a Prototype

Figure 4.12: Greyscale image of the resulting potential vector.

X X X X X X X X X X X X X X X X

X t X

X X X X X

X X X X X X

X X

X X

X X X X X

X X X X X X X X

X X X X X X

X X X

X X X X X X X X X X X X X X X X X X X X X X X X X X X X

Figure 4.13: Direction vectors generated by a path tracing algorithm. By following these vectors from any point in the map, the reference node t is reached via the shortest path possible.

36

Universitat Bielefeld, AG Technische Informatik

5
Analysation and Discussion
The FPGA used in this application supplies a matrix of 30 by 40 LABs, containing 10 LEs each. To achieve an optimal mapping of the graph structure onto the FPGAs LAB grid, the implementation of one graph node in one LAB is desirable. However, on the chosen Stratix 1S10 FPGA two main problems occur: With an increase of the map size, distanced may become longer. This leads to a higher maximal counter value, and the number of DFFs needed to store node potentials grows. No real tri-state signals are supported, so additional logic is required to retrieve the potentials from the FPGA. The following sections discuss these problems.

5.1

Substitution of DFFs by LUT based delay elements

Figure 5.1, taken from the Stratix Device Handbook [Cor04b], shows the internal structure of a LE within a LAB.

Universitat Bielefeld, AG Technische Informatik

37

5 Analysation and Discussion

Figure 5.1: Stratix LE block schematic taken from [Cor04b]

An approach to conserve registers of LEs for storing larger counter values is the replacement of delay DFFs with a LUT based solution. The following requirements have to be met by the replacement: To put the system in a dened state prior to a computation, a reset signal is needed. An input signal to the element must be propagated on a clock edge. An additional asynchronous preset logic is needed in the last delay element to set a reference node. The rst step in the creation of a LUT based register is the implementation of an asynchronous RS ip-op as shown in gure 5.2.

Figure 5.2: Equivalent block schematic for a LUT based RS ip-op

38

Universitat Bielefeld, AG Technische Informatik

5 Analysation and Discussion

The set input is connected to an OR gate. The output of this gate is then masked with the low-active reset n signal and fed back to the second input of itself. As long as the reset n signal stays in a high state, this feedback loop causes the output to stay in a high state, too, once a high signal is assigned to the set input. A low reset n signal causes the output to become low. As long as the input signal also stays low, the ip-op will remain in this state. To use this LUT based ip-op in graph nodes, an extension to create an edge triggered behaviour is needed. In theory, this can be achieved by masking the input signal with the negated clock signal in a rst LUT and then masking the output of the rst LUT with the same clock signal. When the input of the rst LUT switches to high, its output also switches to high during a low clock signal. At the edge of the clock signal, the output of the rst LUT stays high for the propagation delay of the LUT. In this period of time, the second LUT already received the high clock signal without the delay and the LUT output is expected to switch to and stay in a high state until a low reset n signal is assigned.

Figure 5.3: Equivalent block schematic for a LUT based delay element

Figure 5.3 shows the complete schematic for the LUT based delay element. The leftmost AND gate and the inverter are implemented in the rst LUT, the rest of this circuit is implemented in the second LUT. Tests of this circuit in the Quartus II simulator produce contradicting results as shown in gures 5.4 and 5.5. While the functional netlist simulation reports the predicted results, the output signal oscillates in an unstable state in the timing simulation. A test on the FPGA hardware showed the predicted behaviour to be true, so an implementation of a delay element in this way is possible.

Figure 5.4: Results of a functional netlist simulation of the delay element.

Universitat Bielefeld, AG Technische Informatik

39

5 Analysation and Discussion

Figure 5.5: Results of a timing simulation of the delay element

Since the development environment usually optimises a users design to save on-chip ressources, another problem arises during the implementation of this circuit. Every network of logic gates is mapped onto 4-input LUTs, so suitable parts of a circuit are reduced to 4-input functions to t into those LUTs. For example, the RS ip-op in gure 5.2 is synthesized into a single LUT. In the proposed schematic of the clock edge triggered delay element, the optimiser identies both leftmost AND gates as redundant. This part of the schematic can also be written as the equation (a b) b which evaluates to false for all values of a and b, if no propagation delays are taken into account. To solve this problem, the above circuit was created by using a method to directly instantiate logic elements on the FPGA as described in the next section.

5.2

Low-level instantiation of logic elements by VQM source les

To allow design and synthesis of FPGA designs in 3rd party tools without the publication of the bitstream format, Quartus II offers the possibility to include Verilog Quartus Mapping (VQM) les in a project and assemble them. These les make it possible to instantiate and congure FPGA ressources like logic cells (LEs), I/O blocks, on-chip memory, etc. on a low level. As the name suggests, the syntax of VQM les is the same as that of Verilog les. Unfortunately, no detailed public information about VQM les is available, but the Quartus II software can create a VQM le of a compiled project. Therefore, reverse engineering the le format is possible by compiling simple projects and analysing the resulting VQM les.

5.2.1

Structure of VQM les

As an example, gure 5.6 shows the resulting VQM le of the RS ip-op design from the last section. The most interesting part of this le is the instantiation of a logic cell to implement the given logical function. Since the design of the ip-op uses only three inputs overall, it was reduced to a single LE using only one LUT.

40

Universitat Bielefeld, AG Technische Informatik

5 Analysation and Discussion

stratix_lcell <name> (<ports>); creates a logic element named <name> with connections to other logic listed in <ports>. Conguration parameters of the logic element are dened by defparam. Two operation modes are possible for a Stratix LE: normal mode and dynamic arithmetic mode. The Stratix Device Handbook[Cor04b] describes both modes in detail.

The following input and output ports can be dened in a VQM le: dataa, datab, datac, datad These are the four data inputs of the LUT. datac can be congured as the cin carry input signal or as an input for a feedback from the register of the same LE alternately. cin This is the carry chain input signal of a LE. It can only be connected to the cout carry out output of a previous LE in the same LAB. regcascin This is the register cascade input of the LE. To connect the input of the LEs register directly to the output of the register in the previous LE inside a LAB, the regcascin input is used. aclr The aclr input of a LE can be connected to a LAB wide asynchronous highactive reset signal. Only two different signals of this type are available to all LEs in one LAB. aload This is a LAB wide asynchronous preset signal. One signal of thise type common to all LEs inside a LAB can be used. When a high signal is assigned to this input, the register loads the current value of the datac port. clk, ena These are the clock and clock enable inputs of the register in a LE. Both are LAB wide signals, too. There are two clock and two clock enable signals available for each LAB.
Universitat Bielefeld, AG Technische Informatik 41

5 Analysation and Discussion

module

rs_ff (reset_n,set, q);

input reset_n; input set; output q; wire \reset_ncombout ; wire \setcombout ; wire \inst10 ; wire gnd; wire vcc; assign gnd = 1b0; assign vcc = 1b1; stratix_io \reset_nI (.combout(\reset_ncombout ),.padio(reset_n)); defparam \reset_nI .operation_mode = "input"; defparam \reset_nI .ddio_mode = "none"; defparam \reset_nI .input_register_mode = "none"; defparam \reset_nI .output_register_mode = "none"; defparam \reset_nI .oe_register_mode = "none"; defparam \reset_nI .input_async_reset = "none"; defparam \reset_nI .output_async_reset = "none"; defparam \reset_nI .oe_async_reset = "none"; defparam \reset_nI .input_sync_reset = "none"; defparam \reset_nI .output_sync_reset = "none"; defparam \reset_nI .oe_sync_reset = "none"; defparam \reset_nI .input_power_up = "low"; defparam \reset_nI .output_power_up = "low"; defparam \reset_nI .oe_power_up = "low"; stratix_io \setI (.combout(\setcombout ),.padio(set)); // parameters removed, see above stratix_lcell \inst120 (.datab(\reset_ncombout ),.datac(\setcombout ), .datad(\inst10 ),.combout(\inst10 )); defparam \inst120 .operation_mode = "normal"; defparam \inst120 .synch_mode = "off"; defparam \inst120 .register_cascade_mode = "off"; defparam \inst120 .sum_lutc_input = "datac"; defparam \inst120 .lut_mask = "CCC0"; defparam \inst120 .output_mode = "comb_only"; stratix_io \qI (.datain(\inst10 ),.padio(q)); defparam \qI .operation_mode = "output"; // rest of parameters removed, see above endmodule

Figure 5.6: VQM le generated by the Quartus II software for the LUT RS ip-op shown in Figure 5.2

42

Universitat Bielefeld, AG Technische Informatik

5 Analysation and Discussion

sclr This signal is a synchronous LAB wide reset input for the register. One sclr signal can be used per LAB. sload sload is the synchronous counterpart of the aload input. When this LAB wide signal is set, the register stores the value of the datac input on the next clock edge. combout The combout output is the output of the LEs LUT. regout This is the output of the LEs register. cout The cout output is the carry chain output of a LE. It is available in arithmetic mode and can only be connected to the cin port of the next LE in the same LAB. To congure the behaviour of a LE, defparam is used to set required parameters. The availability of some of the above ports also depends on these parameters. operation_mode The value of this parameter can be set to normal or arithmetic. In arithmetic mode, the LUT is split into four 2-input LUTs. Also, the cout output becomes available in this mode. However, the implementation of delay elements only uses LEs in normal mode - a detailed description of the dynamic arithmetic mode is found in the Stratix Device Handbook [Cor04b]. lut_mask The lut_mask is a 16 bit hexadecimal value that represents the truth table of the LUT. The Quartus II Handbook, Volume 3, Section 10 [Cor04a] describes how to calculate the LUT mask for a given equation by the creating of a truth table. The LUT can be compared to a simple 16x1 RAM: with dataa, datab, datac, and datad used as address lines 0 to 3, the LUT mask resembles the stored data word with the least signicant bit rst. register_cascade_mode Possible values for this parameter are on and off. When this mode is enabled, the output of the previous LE in the LAB can be connected directly to the input of the register of the LE that is to be congured by using the regcascin input.
Universitat Bielefeld, AG Technische Informatik 43

5 Analysation and Discussion

sum_lutc_input This parameter determines the signal which is fed into the LUTs datac input. A value of datac connects the LUTs input to the datac input of the LE. The value cin connects the input to the carry in input, and the value qfbk can be used to connect the output of the register in this LE to the LUTs datac input. synch_mode Like register_cascade_mode, a boolean value of on or off can be assigned to this parameter. It has to be set to use the synchronous input ports of the LE. output_mode The LE has three output modes: The LUT or the register output or both can be made available for interconnections. The possible values are comb_only for the LUT output only, reg_only for the register output and reg_and_comb for both. Figure 5.6 also shows the denition of stratix_io input and output pins. Since it is possible to include VQM les in a Quartus II project, the input and output pin assignments can be created by standard tools. Therefore, an implementation of the input and output blocks in VQM is not neccessary. As a side note, while trying to review the VQM le from a more complex project which contained the Nios system-on-a-chip CPU of the Nios Development Kit, the resulting le was no ASCII le. Instead, it contained binary data. It seems that VQM les that contain elements from the Quartus II SOPC Builder are encrypted somehow. A possible reason for this may be the prevention of compiling this CPU for other FPGA architectures by reimplementing the basic Stratix components in Verilog.

44

Universitat Bielefeld, AG Technische Informatik

5 Analysation and Discussion

5.2.2

Implementation of the LUT delay element in VQM

As previously mentioned, the LUT delay element utilises two LUTs.

Figure 5.7: Schematic for a LUT based delay element

Figure 5.7 shows the neccessary connections. The logical function for the LUT labeled stage1 is a b. The equivalent truth table for this function is shown in table 5.1. datab 0 0 1 1 dataa out 0 0 1 1 0 0 1 0

Table 5.1: Truth table for the stage1 LUT

Since the output does not depend on inputs c and d, the output pattern is repeated four times, resulting in a binary LUT mask of 0010001000100010 equal to a hexadecimal value of 2222. According to gure 5.3, the function for the LUT labeled stage2 is ((a b) d) c). Table 5.2 shows the resulting truth table. The resulting LUT mask for this function is a hexadecimal value of F080. Both LUTs can be implemented in VQM les as shown in gure 5.8 and added to a Quartus II project. As soon as the source les are added, the implemented modules can be used in the project like any other component dened in Verilog. In the test environment for the delay element, a symbol was generated for both stage1 and stage2 and they were instantiated in a block schematic le.

Universitat Bielefeld, AG Technische Informatik

45

5 Analysation and Discussion

datad 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

datac datab 0 0 0 0 0 1 0 1 1 0 1 0 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 1 1 1

dataa out 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 1 1 0 1 1 1

Table 5.2: Truth table for the stage2 LUT

5.3

Reduction of the number of needed DFFs by using an alternate path tracing algorithm

In a regular grid, the largest potential difference between two adjacent nodes is the weight of a diagonal edge. If the application needs only one reference node to be set, the shortest path in the map can still be found if a smaller counter is used, as long as the counter can hold a value of at least twice the maximum distance between two adjacent nodes. In longer paths, a counter overrun occurs, but a modication of the path tracing algorithm still allows to nd the shortest path. However, the information about the absolute distance to the reference node in the potential vector is lost. The algorithm described in section 2.2 uses a maximum weight of three. If three bit counters are used, an overrun occurs if the value changes to a number larger than seven. This results in a substraction of eight from the previous value of a counter. When this happens, the minimal difference between two nodes potentials ve in this case is the maximum counter value increased by one minus the largest possible weight of any edge. Thus, in the tracing algorithm, a counter overrun can be assumed to have occurred whenever the difference between the counters of two nodes is equal or larger than ve.
46 Universitat Bielefeld, AG Technische Informatik

5 Analysation and Discussion

module

stage1 (in, clk, out);

input in; input clk; output out; wire in; wire clk; wire out; stratix_lcell stage1_lut (.dataa(in),.datab(clk),.combout(out)); defparam stage1_lut .operation_mode = "normal"; defparam stage1_lut .synch_mode = "off"; defparam stage1_lut .register_cascade_mode = "off"; defparam stage1_lut .sum_lutc_input = "datac"; defparam stage1_lut .lut_mask = "2222"; defparam stage1_lut .output_mode = "comb_only"; endmodule module stage2 (in, clk, reset_n, out);

input in; input clk; input reset_n; output out; wire wire wire wire in; clk; reset_n; out;

stratix_lcell stage2_lut (.dataa(in),.datab(clk),.datac(reset_n), .datad(out),.combout(out)); defparam stage2_lut .operation_mode = "normal"; defparam stage2_lut .synch_mode = "off"; defparam stage2_lut .register_cascade_mode = "off"; defparam stage2_lut .sum_lutc_input = "datac"; defparam stage2_lut .lut_mask = "F080"; defparam stage2_lut .output_mode = "comb_only"; endmodule

Figure 5.8: Implementation of two LUTs for a delay element

Universitat Bielefeld, AG Technische Informatik

47

5 Analysation and Discussion

The pseudo C code in gure 5.9 implements an iteration step of the modied tracing algorithm.
// offsets to neighbours int dx[]={-1, 0, 1,-1, 1,-1, 0, 1}; int dy[]={-1,-1,-1, 0, 0, 1, 1, 1}; int num_neighbours = 8 int min_overrun_diff = 5; int overrun_value = 8; // p[][] contains a two dimensional array // of counter values int max_dist=0; int next_index=-1; // for each neighbour... for(i=0;i<num_neighbours;i++) { // calculate the difference in p int dist = p[x][y] - p [x+dx[i]] [y+dy[i]]; // compensate overruns if(dist <= -min_overrun_diff) dist+=overrun_value; if(dist >= min_overrun_diff) dist-=overrun_value; if(dist > max_dist) { max_dist = dist; next_index = i; } } if(next_index >= 0) { // neighbour with lower potential found x+= dx[next_index]; y+= dy[next_index]; } else { // reference node reached }

// save new maximum distance // save index of offset

Figure 5.9: Pseudo C code for the modied tracing algorithm

5.4

Implementation of the address logic and host interface

The second problem of an efcient implementation of this algorithm involves the transfer of calculated potential data to the host. In the original implementation of the prototype, tri-state buffers supplied by the Quartus II development environment have been used to access the contents of each counter register. Unfortunately, the analysis of these
48 Universitat Bielefeld, AG Technische Informatik

5 Analysation and Discussion

buffers reveals, that they are actually implemented in gate logic, since the used FPGA does not support real tri-state signals.

Figure 5.10: AND-OR replacement logic for tri-state buffers

Figure 5.10 shows an implementation for a substitution of those tri-state buffers using gate logic. chain_in is connected to the output of chain_out of the previous node. The data input is connected to the output of the data signal to be tri-stated. This signal is masked with the select signal by an AND gate, the result is OR combined with the result of the previous nodes and passed on to the next node. Since only one select signal of the whole chain can be active at a time, the masked output of all inactive parts is low. Only the data input of the selected node determines the result of the whole chain.// // On an FPGA architecture with 4-input LUTs, an implementation of this logic ts into one LUT. Even an extension for the combination of row and column select signals by an AND gate can be implemented in a single LUT, but this method is not desired in an implementation with a large node matrix, since a long chain of LUTs would result in a large propagation delay. Instead, the chain can be broken down in one chain for each row. The values of the selected column are then combined within each row, and an additional row select logic provides the nal result. This structure still leaves one input port of a LUT vacant. If this port is congured as an additional input to the OR gate, further reduction of the propagation delay is possible by arranging the elements of the chain in a tree-like structure. Figure 5.11 shows the resulting selection logic for three data bits.

Universitat Bielefeld, AG Technische Informatik

49

5 Analysation and Discussion

Figure 5.11: Efcient selection logic for three data bits

Another desired improvement regarding the host interface is a faster transfer of the map and the result of the calculation. The prototype uses a RS232 connection at 9600 bits per second, and three bytes have to be transfered to congure one node of the map. This results in a duration of 2.5 milliseconds per node. In a map of 16 by 16 nodes, an approximation of the maximum potential value for a worst case scenario is 225 (see section 2.2). With the FPGA running at 50 MHz, the calculation nishes after 4.5 microseconds. Therefore, in a real-world application, a PCI interface could result in a great acceleration. By operating the FPGA synchronously to the PCI bus at 33 MHz or 66 MHz and a using a parallel transfer of 32 or 64 conguration bits in a burst transfer 1 , the overhead of conguration can be reduced signicantly.

5.5

Conclusion

A parallel implementation of the algorithm described in 2.2 on an FPGA is possible. However, the mapping of a graph node onto a single LAB could not be realised, because the additional logic to implement the counter retrieval mechanism increased the number of logic elements needed. Thus, a topology preserving mapping of the graph onto the FPGA would only be possible if a second adjacent LAB is used to implement the complete graph node. This would reduce the maximum map size by 50%. Some tests with different map and counter sizes have been made, and the results demonstrated that the Quartus II development environment is already able to generate a ressource efcient version of this project, though the automatic placement of nodes could not be predicted. Figure 5.12 shows the connections of the incoming edge signals to node (13,3) in a map sized 16 by 16. While a two LAB version of the algorithm uses 20 LEs per node, the
In a PCI burst transfer, one data word of a contiguous memory block can be transfered per clock cycle.
1

50

Universitat Bielefeld, AG Technische Informatik

5 Analysation and Discussion

Size of the node matrix 16x16 24x24 24x24

Nodes total 256 576 576

Counter width 5 bits 5 bits 3 bits

LEs used 4118 9703 7779

LEs/node used 16.1 16.8 13.5

Table 5.3: Map size vs. LEs/node used

synthesiser of the Quartus II software needed less than that number. Table 5.3 shows the needed LEs per node for three different versions of the prototype.

Figure 5.12: Connections of the edge signals to map node (13,3) in a 16 by 16 map

To overcome the problems regarding the retrieval logic, the choice of a different newer generation FPGA appears feasible. For example, the Xilinx Virtex II FPGA supports tri-state buses, however the available number of these signals as described in [Cor03b] is still too limited for this application. Also, the newer Altera Stratix II FPGAs use improved versions of the LAB as basic logic blocks, described in the Stratix II Device
Universitat Bielefeld, AG Technische Informatik 51

5 Analysation and Discussion

Handbook [Cor04c]. These LABs contain eight adaptive logic modules (ALMs) which not only are larger, but a more exible conguration is possible. There are two registers per ALM, and with eight data inputs per ALM altogether, the conguration of two 4input LUTs in one ALM is not the only option. It is possible to congure one 3-input LUT and one 5-input LUT or to use one large 6-input LUT and connecting the remaining two inputs directly to the registers. Some other congurations with different constraints are possible, too. The design using the Stratix LEs utilises a full LUT for combining the already OR combined and delayed diagonal edge signal with the output signal of the OR gate which combines the orthogonal edge signals. Furthermore, the remaining registers in those LEs are hardly accessible. In an ALM, the combination of all four orthogonal edge signal and the delayed diagonal edge signal would be possible in a single 5-input LUT congured as an OR gate while keeping both registers of the ALM available.

52

Universitat Bielefeld, AG Technische Informatik

A
Large format schematics
A.1 Schematic of the top level entity
On the following page, a fold-out of the top level entity of the FPGA design can be found. Unfortunately, the picture for this schematic was taken from an early version of the project and the counter out output of the node matrix is shown with a wrong bus width. While the picture shows an eight bit wide bus, the running prototype uses only a ve bit wide bus. The correct label for this output is counter_out[4..0].

Universitat Bielefeld, AG Technische Informatik

53

A Large format schematics

A.2

Schematic of a graph vertex implemented in digital hardware

54

Universitat Bielefeld, AG Technische Informatik

B
Source code
B.1 main.m - Main module of the Verilog code generator
/* main.m of the verilog code generator */ #import "template.h" struct offset { int dx,dy; }; struct offset edge_offsets[8] = { { -1, 0 }, // left { 1, 0 }, // right { 0, 1 }, // bottom { 0, -1 }, // top { -1, -1 }, // top left { 1, -1 }, // top right { -1, 1 }, // bottom left { 1, 1 } // bottom right }; int main(int argc, char **argv) { int width=16;
Universitat Bielefeld, AG Technische Informatik 55

B Source code

int height=16; unsigned char *line; int x,y,n,i,j; int numnodes=width*height; unsigned char chain_in_buffer[128]; unsigned char edge_buffer[8][128]; unsigned char node_edge_buffer[128]; id header,node,footer; header=[Template new]; footer=[Template new]; [header [header [header [header [header load: "templates/header.v"]; set: "maxnode" toInt: numnodes-1]; set: "maxcol" toInt: width-1]; set: "maxrow" toInt: height-1]; show];

for(y=0;y<height;y++) { for(x=0;x<width;x++) { n=width*y+x; node=[Template new]; [node load: "templates/node.v"]; [node set: "n" toInt: n]; [node set: "x" toInt: x]; [node set: "y" toInt: y]; for(i=0;i<8;i++) { if((x+edge_offsets[i].dx<0)|| (y+edge_offsets[i].dy<0)|| (x+edge_offsets[i].dx>=width)|| (y+edge_offsets[i].dy>=height)) { sprintf(edge_buffer[i],"b0"); } else { sprintf(edge_buffer[i],"node_out[%d]", x+edge_offsets[i].dx + (y+edge_offsets[i].dy)*width ); } sprintf(node_edge_buffer,"in%d",i); [node set: node_edge_buffer
56 Universitat Bielefeld, AG Technische Informatik

B Source code

toString: edge_buffer[i]]; } [node show]; } } [footer load: "templates/footer.v"]; [footer show]; }

Universitat Bielefeld, AG Technische Informatik

57

B Source code

B.2

serial.c - Support functions for the host utilities

/* serial.c - module containing serial port * initialisation and helper functions. */ #include #include #include #include #include #include #include <stdio.h> <string.h> <unistd.h> <fcntl.h> <errno.h> <termios.h> <sys/select.h>

#include "serial.h" #define SERIAL "/dev/tty.USA19QW1b1P1.1" int serial_fd; struct termios options; fd_set read_fds; struct timeval timeout;

int init_serial() { serial_fd=open(SERIAL,O_RDWR | O_NOCTTY | O_NDELAY); if(serial_fd==-1) { perror("cannot open serial port!\n"); exit(0); } fcntl(serial_fd,F_SETFL,0); tcgetattr(serial_fd,&options); cfsetispeed(&options, B9600); cfsetospeed(&options, B9600); options.c_cflag &= PARENB;
58 Universitat Bielefeld, AG Technische Informatik

B Source code

options.c_cflag options.c_cflag options.c_cflag options.c_cflag

&= &= &= |=

CSTOPB; CSIZE; CRTSCTS; ( CS8 | CREAD | CLOCAL );

options.c_lflag &= (ICANON | ECHO | ECHOE | ISIG); options.c_iflag &= (IXON | IXOFF | IXANY); options.c_oflag &= OPOST; tcsetattr(serial_fd, TCSAFLUSH, &options); } void close_serial() { close(serial_fd); } void send_char(unsigned char c) { write(serial_fd,&c,1); } void send_string(unsigned char *s, int length) { write(serial_fd,s,length); } unsigned char read_char() { unsigned char buf; read(serial_fd,&buf,1); return buf; } // fpga commands void reset_map() { send_char(0x08); } void reset_nodes() { send_char(0x04); }

Universitat Bielefeld, AG Technische Informatik

59

B Source code

unsigned char read_node(unsigned char addr) { send_char(0x01); send_char(addr); send_char(0x02); return read_char(); } void write_node(unsigned char addr, int flag) { send_char(0x01); send_char(addr); if(flag) { send_char(0x60); } else { send_char(0x40); } } void activate_node(unsigned char addr) { send_char(0x01); send_char(addr); send_char(0xE0); } void iterate() { send_char(0x10); }

60

Universitat Bielefeld, AG Technische Informatik

B Source code

B.3

map transfer.c - Host utility to transfer a map to the FPGA

/* map_transfer.c - Transfers an ASCII map file * to the FPGA */ #include <stdio.h> #include "serial.h" int main(int argc, char **argv) { FILE *map; char line_buffer[256]; unsigned char addr; int width,height,x,y; if(argc!=2) { printf("usage: %s <mapfile>\n",argv[0]); exit(10); } if(!(map=fopen(argv[1],"r"))) { printf("cant open mapfile!\n"); exit(10); } fgets(line_buffer,256,map); width=atoi(line_buffer); fgets(line_buffer,256,map); height=atoi(line_buffer); init_serial(); reset_nodes(); reset_map(); for(y=0;y<height;y++) { fgets(line_buffer,256,map); for(x=0;x<width;x++) { addr=(x&0x0f)+16*(y&0x0f); if(line_buffer[x]!= ) {
Universitat Bielefeld, AG Technische Informatik

61

B Source code

printf("**"); } else { write_node(addr,1); printf(" "); } } printf("\n"); } close_serial(); }

62

Universitat Bielefeld, AG Technische Informatik

Bibliography

Bibliography
[ALS03] Bertrand Cuzeau ALSE. Simple UART project - VHDL. http://www.alsefr.com, 2003. [Cor99] Xilinx Corporation. XC4000 databook. http://www.xilinx.com, 1999. [Cor03a] Altera Corporation. Nios development board. http://www.altera.com, 2003. [Cor03b] Xilinx Corporation. Virtex-II platform FPGAs: Complete data sheet. http://www.xilinx.com, 2003. [Cor04a] Altera Corporation. Quartus II device handbook. http://www.altera.com, 2004. [Cor04b] Altera Corporation. Stratix device handbook. http://www.altera.com, 2004. [Cor04c] Altera Corporation. Stratix II device handbook. http://www.altera.com, 2004. [Min78] E. Minieka. Optimization Algorithms for Networks and Graphs. Dekker, New York, Basel, 1978. [M l99] Ralf M ller. Path planning using hardware time delays. IEEE Transactions o o on Robotics and Automation, Volume 15 No. 3:pp. 588591, 1999. [Nes03] John A. Nestor. FPGA Implementation of a Mulitlayer Maze Router. Department of Electrical and Computer Engineering, Lafayette College, Easton, 2003. [TS01] Matti Tommiska and Jorma Skytt . Dijkstras shortest path routing algorithm a in recongurable hardware. LNCS 2147, pages pp. 653657, 2001.

Universitat Bielefeld, AG Technische Informatik

63

Bibliography

64

Universitat Bielefeld, AG Technische Informatik

Bibliography

Hiermit versichere ich, da ich diese Diplomarbeit selbst ndig bearbeitet habe. Ich habe a keine anderen als die angegebenen Quellen und Hilfsmittel benutzt und entsprechende Zitate kenntlich gemacht. Bielefeld, den November 11, 2004

Marcus Grieger

Universitat Bielefeld, AG Technische Informatik

65