Sie sind auf Seite 1von 96

Departement Elektrotechnik Professur fr Technische Informatik Professor Dr.

Lothar Thiele

Matthias Dyer Marco Wirz

Recongurable System on FPGA


Diploma Thesis DA-2002.14 Winter Term 2001/2002 Tutors: Ch.Plessl, H.Walder

Supervisor: Prof. Dr. Lothar Thiele

Institut fr Technische Informatik und Kommunikationsnetze Computer Engineering and Networks Laboratory

Abstract
FPGAs keep getting larger and faster. They have reached a level where a whole 32 bit CPU ts into a single FPGA and doesnt even ll it. So FPGAs can house quite large logic circuits. Another development branch leads to dynamically recongurable FPGAs. That means that certain areas within the FPGA can be recongured while the rest continues to run unaffected. The next step is to combine these two abilities. In this thesis we show how we implemented a CPU on an FPGA and combined it with additional cores which can be dynamically exchanged while the CPU continues to run unaffected. Thereby we want to use a ow which allows us to implement the entire design with mainstream synthesis tools. We explain the steps it took to build the whole CPU on the FPGA, to add a network card to transport data to the FPGA, and to get two different audio codecs to work. These audio codecs are the dynamic units on the FPGA, they will be replaced on demand with dynamic reconguration. We also describe the installation of the operating system used to run on the CPU including the development of the necessary network driver and the application program. We will then present the techniques used to create bitstreams for partial reconguration using the JBits SDK and the difculties that arise because of the nonexistence of routing constraints in FPGA implementation tools.

iii

Preamble
In summer 2001 when we decided to do this project as our diploma thesis, both of us could hardly program anything in VHDL. We had done an introduction course in VLSI, but that was about all of it. Neither did we have much knowledge about the internals of an FPGA. Of course we had heard about this topic in different lectures here at the Swiss Federal Institute of Technology, and we had occasionally played with them, but then only with graphical design tools. In short, we didnt really know what was expecting us then. So in late october when we nally started we rst had to read a great deal about all these tools and stuff, but actually, we got LEON to run within two weeks after start. Although this was mainly because LEON had already been congured for the development board we were using and we were by far not the rst ones who try this, this was was good for our motivation to go on to the harder parts. To write a PCM codec was one of the simpler tasks, and we collected our rst experiences with VHDL, and soon we got into it. For the network card, we found a project from the University of Queensland, Australia which implemented a complete IP stack in VHDL. But the network interface in there seemed to be quite complicated, so we tried to do it in a simpler way. And we can say that we at least partially succeeded. Our solution is surely not as simple as it can get and not yet nished at all, but it is an intuitive design and the implemented part works. It took us quite a while to implement the whole card, and it was the day before christmas break when it nally performed as it should have. So after the break we could start with writing the software driver. This was another eld where we had little to no experience. So this too took quite a while to reach its nal form. In the mean time the experiments with conguration on the FPGA had reached a form where we could start using JBits to create a partial bitstream for dynamic reconguration. Once more, this was a eld we were absolute newbies, so after a lot of trial and error we managed to get the reconguration working, but the newly congured audiocodec would only produce a loud whistling noise. The big break through only came one day before the nal presentation. On the JBits mailing list, someone remarked that whith him, some feature worked with JBits version 2.7 but not with 2.8. This gave us the idea to try the same with the older

Preamble version 2.7. And what a big surprise, suddenly we could recongure successfully. So at the presentation we could at least say "it works!" During the whole theses, we both learnt a lot. We got insight into as different elds as VHDL programming, FPGA conguration, CPU design, operating system architecture, the Ethernet protocol and audio codecs. It was a very interesting time. If we had to choose both of us would do the same project again. Finally, we would like to thank a few people who helped us in one or the other way: Our tutor Christian Plessl He gave us great support with ideas. We could always bug him with our questions which he answered helpfully. He was also right at hand for questions concerning presentation and documentation. Our co-tutor Herbert Walder Mainly his experience with JBits and dynamic reconguration were very fruitful in our work. Prof. Lothar Thiele We thank him that we could do this thesis in his research group. The provided infrastructure was an important key point of our success. Michael Lerjen Another student doing his diploma thesis, we could always ask him when we had a problem with VHDL again. Without him, our network card would not have improved so much. Wed especially like to thank him for the improvement in the CRC checker code from the University of Queensland, which he adapted for his own project in such a way that it was also useable for our project. Other students in our lab Finally, we dont want to forget all the other students in our lab doing their own thesis. We had a lot of interesting discussions especially during lunch hours, not only about our projects, but about nearly everything, but mostly related some way or the other with computers. Zrich, March 15, 2002

Marco Wirz

Matthias Dyer

vi

Contents
Abstract Preamble Contents Figures Tables 1: Preface
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Problem Task (in german) . . . . . . . . . . . . . . . . . . . . . . . . . .

iii v vii xi xiii 1


1 2

2: Dynamic Reconguration
2.1 2.2 2.3 2.4 Introduction . . . . . . . . . . . . . . . . . . . . . . . . Virtex FPGA Architecture Overview . . . . . . . . . . Dynamic Reconguration for the Virtex Series FPGA Design Flows . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Flow 1: Without JBits . . . . . . . . . . . . . . 2.4.2 Flow 2: JBits Only . . . . . . . . . . . . . . . . 2.4.3 Flow 3: Combined Flow . . . . . . . . . . . . . . 2.4.4 Flow 4: Use JBits to merge Cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9
9 11 12 13 13 14 14 15

3: Development Platform
3.1 Xilinx Virtex XCV800 FPGA . . 3.2 XSV800 . . . . . . . . . . . . . . 3.3 The LEON Processor . . . . . . 3.3.1 VHDL Conguration . . 3.3.2 Booting . . . . . . . . . . 3.3.3 Top Level Design . . . . 3.3.4 UARTs 1 and 2 . . . . . 3.3.5 Synthesis with Synopsys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17
17 17 18 18 20 21 21 21 vii

Contents 3.3.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 The Operating System RTEMS . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Program Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 22 23 23

4: Network
4.1 Overview . . . . . . . . . . . . . . 4.2 Ethernet . . . . . . . . . . . . . . 4.3 Prerequisites . . . . . . . . . . . . 4.3.1 Hardware . . . . . . . . . . 4.3.2 Software . . . . . . . . . . 4.4 Network Hardware . . . . . . . . 4.4.1 Architecture . . . . . . . . 4.4.2 Address Decoder . . . . . . 4.4.3 FIFOs . . . . . . . . . . . . 4.4.4 CRC . . . . . . . . . . . . . 4.4.5 Receiver . . . . . . . . . . 4.4.6 Sender . . . . . . . . . . . 4.4.7 Possible Improvements . . 4.5 Software . . . . . . . . . . . . . . 4.5.1 Streaming Data to LEON 4.5.2 Driver . . . . . . . . . . . . 4.5.3 Application on LEON . . . 4.5.4 Application on PC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27
27 28 29 29 29 29 29 31 32 32 33 34 35 37 37 38 39 40

5: Implementation of a dynamic recongurable System


5.1 Partitioning . . . . . . . . . . . . . 5.2 Interface . . . . . . . . . . . . . . . 5.3 Virtual Components . . . . . . . . . 5.3.1 VC Interface . . . . . . . . . 5.3.2 PCM Player . . . . . . . . . 5.3.3 ADPCM Player . . . . . . . 5.4 Constraining the Design . . . . . . 5.4.1 Floorplanning . . . . . . . . 5.4.2 Guided Routing . . . . . . . 5.4.3 CLB Macros . . . . . . . . . 5.5 Bitstream Manipulation with JBits 5.5.1 Introduction . . . . . . . . . 5.5.2 Function Blocks . . . . . . . 5.6 Partial Reconguration . . . . . . . 5.7 Implementation Results . . . . . . 5.7.1 Dynamic Routing Flow . . . viii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43
44 44 45 46 46 48 55 55 56 58 60 60 61 63 63 63

Contents 5.7.2 Direct Copy Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7.3 Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7.4 Design Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 64 65

Conclusions Future Work A: LEON VHDL les B: UCF Constraint File C: Installing and Compiling RTEMS D: Miscellaneous Bibliography

67 69 71 73 77 79 81

ix

Contents

Figures
2-1 2-2 2-3 2-4 2-5 2-6 Example of Dynamic Reconguration . . . . Basic architecture of a Virtex FPGA . . . . . Virtex 2-Slice CLB . . . . . . . . . . . . . . . Dynamic Reconguration (Flow 1) . . . . . . Flow4: Use of JBits to directly copy a module Flow 4: Use of JBits and dynamic routing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 11 12 14 15 16 18 19 30 33 34 43 44 45 46 47 48 49 52 53 55 57 59 60 61

3-1 Block Diagram of XSV800-Board . . . . . . . . . . . . . . . . . . . . . . 3-2 Block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1 Architecture of the network card . . . . . . . . . . . . . . . . . . . . . . 4-2 State machine of the receiver . . . . . . . . . . . . . . . . . . . . . . . . 4-3 State machine of the sender . . . . . . . . . . . . . . . . . . . . . . . . . 5-1 Application of a dynamic recongurable System 5-2 Partitioning of our recongurable system . . . . 5-3 Detailed view of the interface . . . . . . . . . . . 5-4 Virtual Component Entity Schematic Symbol . . 5-5 Timing for AK4520A Stereo Codec . . . . . . . . 5-6 ADPCM Player Stages . . . . . . . . . . . . . . . 5-7 ADPCM Splitter FSM . . . . . . . . . . . . . . . . 5-8 Control Path State Diagram . . . . . . . . . . . . 5 -9 ADPCM Decoder Architecture . . . . . . . . . . . 5 -10 Flooplanning, Guided Routing and CLB Macros 5 -11 Guided Routing . . . . . . . . . . . . . . . . . . . 5 -12 Internal view of a passthrough CLB . . . . . . . 5 -13 Double Stage CLB Macro . . . . . . . . . . . . . . 5 -14 JBits Design Flow for partial Reconguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

Figures

xii

Tables
3-1 Main conguration of LEON . . . . . . . . . . . . . . . . . . . . . . . . . 4-1 Ethernet frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2 Memory locations of the network card . . . . . . . . . . . . . . . . . . . 4-3 Possible hardware improvements . . . . . . . . . . . . . . . . . . . . . . 5-1 5-2 5-3 5-4 5-5 VC Input and Output Signals . . . . . . . . . Sequence to produce serial audio data . . . . ADPCM word format . . . . . . . . . . . . . . ADPCM Control Sequence (in pseudo VHDL) Step Shift Register States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 29 31 37 46 48 49 51 54

xiii

Tables

xiv

1
Preface
1.1 Motivation
For many years much of the research activity in computer architecture was focussed on designing fast general purpose CPUs. Driven by new applications particularly from the multimedia domain, general purpose CPUs were enhanced by adding additional functional units for better support of the peculiarities of these applications. It turned out, that in spite of all advances in computer architecture, the computing power of general purpose CPUs is not sufcient for certain applications, e.g. realtime video compression. Usually these kinds of applications are enabled by using dedicated hardware based on application specic integrated circuits (ASICs). While being an appropriate solution for a xed application, an ASIC based solution has the inherent disadvantage that the functionality of the ASIC is not changeable and therefore the ASIC cannot be used for a different purpose. Recongurable Computing bases also on the idea to accelerate the computing intensive parts of algorithms using application specic circuits. But in contrast to the ASIC approach, the circuits are implemented in recongurable logic. Usually, a recongurable computing system consists of a general purpose CPU, which is coupled to a recongurable device, for instance a eld programmable gate array (FPGA). While recongurable unit takes care of the computing intensive kernels of the applications, the CPU is used for the rest of the computations. The ability of changing the functionality of the circuit simply by reprogramming the recongurable unit adds greatly to the exibility of such a system. In the last couple of years a new generation of high-density and high-speed FPGAs emerged, for instance the Xilinx Virtex series. The capacity of these devices is sufcient for implementing a whole recongurable computing system consisting of a 32bit CPU core and a recongurable unit. An implementation of the complete system in an FPGA enables an arbitrary coupling of CPU core and recongurable units. Xilinx Virtex devices provide an advanced reconguration feature called partial re1

Chapter 1: Preface conguration. Partial reconguration allows to recongure parts of the FPGA only, while the the parts the the circuits that are not subject to this reconguration keep on working without interruption. The availability of free CPU IP cores and high-density partial recongurable FPGAs are the fundamentals for this work. The idea of this thesis is to investigate how a recongurable computing system based on a 32bit CPU core can be implemented on a Xilinx Virtex FPGAs. To support the replacement of recongurable units at runtime a generic interacts for components needs to be dened. To make the system really usable a method for generating FPGA congurations for the system and the recongurable units is needed. As none of the existent mainstream FPGA circuit synthesis tool supports partial reconguration a framework which provides this functionality is required. Favorably this tool builds on the ordinary well-tried design tools and adds the support for partial reconguration on top of it.

1.2

Problem Task (in german)

1.2. Problem Task (in german)

DA-2002.14: Recongurable System on FPGA Wintersemester 2001/02


Christian Plessl Institut fr Technische Informatik und Kommunikationsnetze ETH Zrich 23. Oktober 2001
Betreuer: Studenten: Dauer: Christian Plessl <plessl@tik.ee.ethz.ch> Mattias Dyer <mdyer@ee.ethz.ch> Marco Wirz <mwirz@ee.ethz.ch> 23. Oktober 2001 - 1. Mrz 2002

1 Thematischer Hintergrund
Whrend vieler Jahre konzentrierte sich die Forschungsaktivitt im Bereich CPU Architektur vor allem im Entwurf immer schnellerer Gerneral Purpose CPUs. Dabei wurden grosse Fortschritte erzielt und die Rechenleistung von Hochleistungs CPUs steigt seit Jahren durch weitere Fortschritte in der Architektur. Dies wird zum Beispiel das Zufgen von spezialisierten Ausfhrungseinheiten erreicht, z.B. Datenverarbeitungeinheiten fr Multimedia Anwendungen. Es hat sich gezeigt, dass trotz allem Fortschritt in der Computerarchitektur einige Applikationen nach noch mehr Rechnerleistung verlangen. Dazu gehren besonders Anwendungen aus dem Bereich Kryptographie und Multimedia, wie Audio-, Video- und Bildverarbeitung. Um die fr diese Algorithmen bentigte Rechenleistung bereitzustellen, greift man meist auf Applikationssspezische Integrierte Schaltungen (ASICs) zurck, die ins Computersystem integriert werden um eine bestimmte Aufgabe zu beschleunigen z.B. Einsteckkarten fr MPEG Video Komprimierung. Recongurable Computing fhrt die Idee zeitkritische Teile von Algorithmen mittels problemangepasster Hardware zu beschleunigen noch weiter. Die Grundidee dabei ist es, die problemangepasste Hardware mit programmierbaren Hardwarebausteinen zu realisieren. Die Abkehr 1

Chapter 1: Preface

von statischer Hardware bietet die Mglichkeit, die Hardware im Betrieb umzuprogrammieren, sog. Rekonguration, und erlaubt es die verfgbaren Hardwareressourcen dynamisch fr verschiedene Anwendungen zu nutzen. Es existieren inzwischen eine Vielzahl von Recongurable Computing Forschungsprojekten, die diese Idee fr verschiedene Anwendungsszenarien genauer untersuchen. Meist liegt einem solchen System eine gewhnliche CPU zugrunde, die mit kongurierbarer Logik gekoppelt ist, welche lediglich fr die zeitintensiven Kernel der Applikationen verwendet wird. An unserem Institut luft mit Zippy ebenfalls ein Forschungsprojekt auf diesem Gebiet. Eine Mglichkeit fr die Realisierung von anwendungsspezischer Hardware in Recongurable Computing Systemen sind Field-Programmable Gate Arrays (FPGAs). Diese stellen die heute am weitesten entwickelten, programmierbaren Logikbausteine dar und werden in einer Vielzahl von Anwendungen eingesetzt. Auf heutigen highdensity FPGAs lassen sich Schaltungen mit einer Komplexitt bis zu mehreren Millionen Gattern realisieren. Durch die grosse Kapazitt von FPGAs ist durchaus auch die Integration einer kompletten CPU auf einem FPGA mglich. Somit ist es machbar ein komplettes Recongurable Computing System auf einem einzigen Baustein zu implementieren, statt eine dedizierte CPU zu benutzen. Der Vorteil eines solchen Systems liegt in der sehr nahen Kopplung von CPU und der benutzer-kongurierbaren Logik. Dies erlaubt, verschiedene Arten der Koppelung zwischen CPU und der rekongurierbaren Logik zu untersuchen oder die CPU selbst um rekongurierbare Recheneinheiten zu erweitern. Die Realisierung der CPU selbst auf dem FPGA hat zur Folge, dass die CPU dadurch weniger schnell ist gegenber einer dedizierten ASIC Realisierung. Eine offene Frage hierbei ist, ob die dazugewonnene Flexibilitt bezglich der Integration von CPU und Logik die Einbusse an Geschwindigkeit der CPU wettmachen kann.

2 Problemstellung
In den letzten Jahren entstanden einige Projekte, welche eine komplette CPU inklusive Peripherie in einem FPGA implementieren. Waren die ersten dieser Projekte eher einfache 8- oder 16bit CPUs [4] erlaubt die gesteigerte Kapazitt und Geschwindigkeit aktueller FPGAs auch die Implementation von leistungsfhigeren 32bit CPUs, z.B. die 32bit CPU LEON [3], welche den SPARC V8 Instruktionssatz implementiert. Heute existieren verschiedene CPU Komponenten fr FPGAs, sowohl als freie Designs als auch als kommerzielle Produkte. Diese Komponenten werden als IPCores bezeichnet. Man unterscheidet zwei Ar2

1.2. Problem Task (in german)

ten von IPcores: a) Hard IPCores, die als Netzliste fr eine bestimmte FPGA Technologie geliefert werden und b) Soft IPCores, welche als synthetisierbare Schaltungsbeschreibung in einer Hardware Beschreibungs Sprache, wie z.B. VHDL oder Verilog, erhltlich sind. In dieser Arbeit sind Soft IPCores von besonderem Interesse, da sie die Mglichkeit von Modikationen am CPUCore selbst erlauben. Das Ziel dieses Projektes ist es, ein komplettes System bestehend aus CPU, RAM und Peripherie auf einer FPGA Plattform zu implementieren und in einem weiteren Schritt zu einem Recongurable Computing System zu erweitern. Dabei soll der LEON Soft IPcore fr die CPU verwendet werden. Hierbei handelt es sich um eine 32bit CPU, welche ursprnglich von der European Space Agency (ESA) entwickelt wurde und deren Sourcecode frei unter der LGPL Open-source Lizenz verfgbar ist. LEON implementiert den SPARC V8 Instruktionssatz und lsst leicht an verschiedene Implementationsplattformen (FPGAs, ASIC) anpassen. Als Zielplattform wird das XSV800 FPGA Board der Firma XESS verwendet [8]. Dieses bietet einen Xilinx Virtex XCV800 FPGA [9], 2 MB SRAM und diverse Peripherieschnittstellen. Die Diplomarbeit umfasst zwei Gebiete. In einem ersten Schritt, soll die LEON CPU auf dem Board implementiert und getestet werden. Der Schwerpunkt bei diesem Teil der Arbeit ist es, die Grundlagen fr die Implementierung eine Computersystems auf LEON Basis zu legen: Synthese der CPU, Anpassungen an die Prototypen Plattform, Softwareentwicklung und Debugging. Um aus diesem System ein Recongurable Computing System zu machen, soll in einem zweiten Schritt eine benutzerdenierbare Hardwareeinheit integriert werden. Diese kann anfangs eine statische Einheit sein, die bei der Synthese des Prozessors generiert wird. In einem weiteren Schritt soll untersucht werden, wie diese Einheit so gestaltet werden kann, dass sie dynamisch zur Laufzeit ersetzt werden kann. Dabei soll ein Konzept eines allgemeinen Interfaces zwischen der CPU und benutzerspezische Hardwareeinheiten entwickelt werden. Die Funktionsfhigkeit des Interfaces soll mittels eigener Hardwareeinheiten demonstriert werden. In einem letzten Schritt soll untersucht werden, ob und wie sich Hardwareeinheiten zur Laufzeit dynamisch austauschen lassen.

3 Teilaufgaben
1. Grundlagen FPGAs Arbeiten Sie sich in die Grundlagen von FPGAs, insbesondere in die auf unserem Board verwendeten Xilinx Virtex Reihe [9] ein. Machen Sie sich mit dem XESS XSV800 Board [8] und den FPGA Design Tools vertraut indem Sie einige 3

Chapter 1: Preface

Testschaltungen implementieren. 2. Grundlagen LEON / SPARC Verschaffen Sie sich einen berblick ber LEON [3] und die SPARC V8 Architektur[7]. 3. Implementierung CPU Core Erstellen Sie ein Konzept, wie die CPU auf dem Board implementiert werden soll. Implementieren und testen Sie ihre CPU. 4. Entwicklungsumgebung fr CPU Core Um eine komfortable Softwareentwicklung fr das Board zu ermglichen, bentigen Sie Bootloader, Assembler, Linker und Linker. Verwenden Sie dazu den GNU Crosscompiler und die GNU Binutils fr SPARC. Es gibt dazu ein speziell angepasstes LEON Crosscompiler Kit (LECCS) [2]. Untersuchen Sie ob ein einfaches Embedded Betriebssystem, wie z.B. eCOS [6] oder RTEMS [1] fr Ihre Zwecke ntzlich ist und ob eine Anpassung auf diese Plattform mglich ist. Auf dem XESS Board existieren eine Vielzahl von Peripherieschnittstellen, implementieren die fr Debugging und Test wichtigen Interfaces [5]. 5. Integration von Benutzerkongurierbarer Logik in den CPU Core Untersuchen Sie verschiedene Mglichkeiten, wie benutzerkongurierbare Logik von der CPU aus angesprochen werden kann (Memory mapped, AMBA Bus, Recongurable functional units) und implementieren Sie eine geeignete Variante. 6. Konzept fr dynamische Rekonguration der Benutzerkongurierbaren Logik Machen Sie sich mit den Mechanismen der partiellen Rekonguration des Xilinx Virtex vertraut [10]. Untersuchen Sie, wie in das CPU Core Design eine generische Komponente integriert werden kann, die zur Laufzeit gegen eine andere ausgetauscht werden kann. 7. Designow fr Recongurable Computing System Untersuchen Sie, wie sich Ihre Konzepte des dynamischen Rekongurierens der benutzerspezischen Einheiten umsetzen lassen. Untersuchen Sie dabei die Mglichkeiten des JBits Tools von Xilinx. Denieren und Implementieren Sie einen Demonstrator der die Prinzipien der dynamischen Rekonguration aufzeigt.

4 Organisatorisches
Zeitplan Erstellen Sie am Anfang Ihrer Arbeit zusammen mit dem Betreuer einen realistischen Zeitplan. Halten Sie Ihren Arbeitsfortschritt laufend fest. 4

1.2. Problem Task (in german)

Dokumentation Dokumentieren Sie Ihre Arbeit sorgfltig. Legen Sie dabei besonderen Wert auf die Beschreibung Ihrer berlegungen und Designentscheide.

Literatur
[1] OAR Corp. RTEMS homepage. http://www.rtems.com. [2] Jiri Gaisler. LECCS LEON/ERC32 cross compilation system. WWW. http://www.gaisler.com/leccs.html. [3] Jiri Gaisler. The LEON processor users manual. Gaisler Research, version 2.3.7 edition, August 2001. [4] Jan Gray. Building a RISC system in an FPGA. Circuit cellar, 116:2633, March 2000. [5] James O. Hamblen and Michael D. Furman. Rapid Prototyping of Digital Systems. Kluwer Academic Publishers, 2000. [6] RedHat. eCos homepage. http://sources.redhat.com/ecos. [7] SPARC International Inc., 535 Middleeld Road, Suite 210, Menlo Par, CA 94025, USA. The SPARC Architecture Manual, Version 8, sav080si9308 edition, 1992. [8] XESS corporation, 2608 Sweetgum Drive, Apex NC 27502, USA. XSV Board V1.1 Manual, version 1.1 edition, September 2001. [9] Xilinx. Xilinx Virtex 2.5V Field Programmable Gate Arrays, v2.5 edition, April 2001. [10] Xilinx Inc. Xilinx Application Note XAPP138: Virtex FPGA Series Conguration and Readback, v2.4 edition, 7 2001. http://www.xilinx.com/xapp/xapp138.pdf.

Chapter 1: Preface

2
Dynamic Reconguration
2.1 Introduction
Dynamic Reconguration is the ability to update only a portion of the conguration memory in an FPGA with a new conguration without stopping the functionality of the unchanged section of the FPGA [13]. Dynamic Reconguration enlarges the design space for developers. Different logic functions can be stored in memory until the need arises for them to be congured into the FPGA. Recent advances in the manufacturing process promise 50 million gates of recongurable logic by 2005 at substantially lower costs. The increased gate count along with richer embedded feature sets have greatly improved the economics for using Recongurable Technology. One single FPGA can simultaneously carry various complex cores like processors, decoders and lters just to name a few of them. Dynamic Reconguration allows to replace a specic core if a new function is required. This situation is similar in the manner with computers with large hard drives storing applications for days before they are loaded into memory. Imagine a system which uses ve different cores over the time, but not more than three simultaneously. Without dynamic reconguration you would either need a huge FPGA which can carry all cores at once, or you would need three individual FPGAs which will be fully recongured. The rst case is a waste of FPGA area. The second case implies increased hardware costs and power consumption. With dynamic reconguration an FPGA which has the size to carry three cores (for all occurring combinations) will sufce. Figure 2 -1 points out this advantage. All cores are stored in memory. On demand, an unused core can be replaced with a new core by a partial bitstream. The difference to a full reconguration is that the other cores arent affected by the reconguration and keep their state. With following list, we point out some advantages of dynamic recongurable systems: Rapid Prototyping: With dynamic reconguration a modular design can be implemented. A team of engineers can independently work on different pieces 9

Chapter 2: Dynamic Reconguration

Memory

Dynamic Reconfiguration

2 1 3 1

2 5 1 4 4

Figure 2-1: Example of Dynamic Reconguration

10

2.2. Virtex FPGA Architecture Overview of a design and later merge these modules into one FPGA design. This parallel development saves time and allows for independent timing closure on each module. Dynamic reconguration also allows you to modify a module while leaving other, more stable modules intact. Reconguration Speed: The time needed for a dynamic reconguration is proportional to the size of the conguration bitstream, which depends on the area to change. If only 20% of an FPGA is recongured the programming time will also be 20% of a full reconguration. FPGA Size: If the design uses some of the modules only temporarly, the FPGA area can be shared with dynamic reconguration. You can therefore use a smaller FPGA. Glue Logic: Having the modules together on a single FPGA instead of separate components allows exible complex and highspeed connections between the cores.

2.2

Virtex FPGA Architecture Overview


Figure 2-2
CLB

Basic architecture of a Virtex FPGA

BLOCK RAM

IOB

Figure 2 -2 shows an architecture overview of a Virtex FPGA. Virtex FPGAs [1] are composed of an array of Congurable Logic Blocks (CLBs) surrounded by a ring of Input/Outputs Blocks (IOBs). On the east and west edges are Block RAMs (BRAMs). The CLBs are the primary building blocks that contain elements for implementing customizable gates, ip ops, and wiring for connectivity. The IOBs provide circuitry for communicating signals with external devices. The BRAMs allow for synchronous or asynchronous storage of kilobits of data, though each CLB can also implement synchronous/asynchronous 32-bit RAMs. Each CLB contains two slices (see gure 2-3). Each slice implements two 4-input LookUpTables (LUTs), two DType ipops, and some carry logic. The general routing allows data to be passed to or received from other CLBs.

11

generator in each LC drives both the CLB output and the D input of the flip-flop. Each Virtex CLB contains four LCs, organized in two similar slices, as shown in Figure 4. Figure 5 shows a more detailed view of a single slice. In addition to the four basic LCs, the Virtex CLB contains logic that combines function generators to provide functions Chapter 2: Dynamic Reconguration
COUT

has synchronous set and reset signals (SR and BY). SR forces a storage element into the initialization state specified for it in the configuration. BY forces it into the opposite state. Alternatively, these signals can be configured to operate asynchronously. All of the control signals are independently invertible, and are shared by the two flip-flops within the slice.

COUT

G4 G3 G2 G1 RC LUT Carry & Control SP D Q EC

YB Y YQ

G4 G3 G2 G1 RC LUT Carry & Control SP D Q EC

YB Y YQ

BY

BY XB X

XB F4 F3 LUT F2 F1 RC Slice 0 Carry & Control SP D Q EC X XQ

F4 F3 F2 F1 LUT Carry & Control SP D Q EC

XQ

BX

RC Slice 1

BX

slice_b.eps CIN CIN

Figure 4: 2-Slice Virtex CLB Figure 2-3: Virtex 2-Slice CLB

2.3
Module 2 of 4 4

Dynamic Reconguration for the Virtex Series FPGA


www.xilinx.com DS003-2 (v2.6) July 19, 2001

The Virtex series FPGA supports dynamic reconguration. The conguration Product logic Specification 1-800-255-7778 is separated from the user logic and does not require the use of normal resources allowing for continued operation of sections that do not change. The conguration write sequence is a glitchless operation, so that only the memory bits that were modied are toggled. The one exception to this is the Blockram. The conguration logic requires the use of the read/write ports of the Block SelectRAMs when the memory contents must be read or written [13]. The smallest amount of conguration memory that can be written to or read from is a frame. A frame spans from the bottom of the device to the top of the device, including the IOBs and CLBs, and contains a section of the data needed for each row. While an entire frame must be written into the device, only the bits that have changed will be toggled. This can allow a single bit to be changed without affecting the rest of the device operation. The routing conguration memory is modied in the same manner as the logic conguration. Modication of routing connectivity may cause contention. This will not damage the device as long as it is short (<30ms). A signal which passes through a section of change will continue to pass the data during the reconguration, providing that the reconguration does not intentionally change connections to the signal wire. Although some FPGAs like the Virtex series support dynamic reconguration, there doesnt exist a design environment to develop complex dynamic recongurable sys12

2.4. Design Flows tems in an integrated ow so far. The manufacturers are aware of this deciency and are working on enhancements to be included in future versions of the implementation software. Until then, developers have to climb down to a low level of bitstream manipulation. With Xilinxs JBits SDK [14], a Java program to dynamically create or manipulate Virtex bitstreams, a rst powerful tool is available. But it is still far from the comfort of highlevel language hardware compilers.

2.4

Design Flows

We can distinguish four different Design Flows for dynamic reconguration with todays tools: Flow 1: Design all modules with common implementation tools and extract directly from the bitstreams the partial reconguration. Flow 2: Design all modules with JBits with the restriction to work on low abstraction level. Flow 3: Design an initial version with common implementation tools and apply dynamic changes with JBits. Flow 4: Design all modules with common implementation tools and merge them with JBits. Which of the ows is to be preferred depends on the targeted application. None of them seems optimal. Due to the lack of an elaborated design environment, developers have to accept restrictions.

2.4.1

Flow 1: Without JBits

Without JBits partial reconguration usually consists of a manual (or script based) manipulation of bitstreams. All nonpartial bitstreams are generated with common synthesis and implementation tools. From these bitstreams the modules are extracted by copying only the frames containing the modules into a new partial bitstream. The corresponding frames on the FPGA device are then replaced with the ones from the partial bitstream by dynamic reconguration (see gure 2-4). A complete study of this ow can be found in [4].

Restrictions
A restriction that comes along with this ow, is that the granularity of changes is one frame. This implies a vertical segmentation of the design, which is not always applicable. Because no dynamic routing is made, the connection to other components have to be exactly the same for every module. This requires to design a hard constrained interface to ensure that every module uses the same routing resources for the connections.

13

Chapter 2: Dynamic Reconguration


bitstream 1 partial bitstream

Figure 2-4
Flow 1: Dynamic Reconguration without JBits. Extract frames from bitstream 1 into a partial bitstream and paste it over an existing conguration (bitstream 2)

frames to extract bitstream 2 (on FPGA)

dynamic reconfiguration

2.4.2

Flow 2: JBits Only

If you design all modules in JBits, you can easily use the dynamic reconguration ability of this program. At any time the program can write changes in the design as a partial bitstream to the device. The JBits SDK also includes an automatic router, which can dynamically route and unroute connections.

Restrictions
This ow seems only applicable for small or data ow oriented applications. There are no tools in JBits to implement state machines or other helpful modeling constructs. JBits is still in development and incomplete. For example the abilities of the autorouter are still limited. Not all resources can be used (e.g. long lines).

2.4.3

Flow 3: Combined Flow

An initial design is created with common synthesis and implementation tools and congured to the FPGA device. Then JBits applies all further changes dynamically. The changes are not taken from another bitstream but created by the JBits program itself. This ow is suitable for complex application, which need only minor dynamic changes like changing parameters of an algorithm or changing connectivity.

Restrictions
Now the restrictions of Flow 2 only apply to the dynamic changes and not to the whole design anymore. The difculty with a combined ow is to design the initial 14

2.4. Design Flows circuit in a way that the JBits program can nd specic resources again. Therefore the initial design needs exact oorplanning. Another restriction concerns signal timing. The implementation tool usually uses a delay based routing algorithm to create low skew circuits. If these connections are removed and rebuild with JBits the timing is not guaranteed anymore, unless the program veries explicit the delays of the new routes.

2.4.4

Flow 4: Use JBits to merge Cores

This ow is comparable with Flow 1. The difference is that we use JBits to extract the modules out of the bitstreams and to merge them together with dynamic reconguration. There are two main advantages over Flow 1: To extract a module, JBits is not limited to the granularity of frames. An individual CLB can be read with all linked resources and be written as a partial bitstream to a device. Secondly a module can be connected to the interface with JBits autorouter. We can realize dynamic reconguration in two ways within this ow. Figure 2 -5 shows a ow which directly copies the module from one bitstream to another. A ow which uses the routing function of JBits is illustrated in Figure 2 -6.
bitstream 1

Figure 2-5
Flow 4: Use of JBits to direct copy a module: A rst bitstream (bitstream 1) is loaded into memory. It contains a module, which is connected via a hard route to the interface. The recongurable area is then copied and pasted over the module of the actual conguration (bitstream 2). Since all modules use the same (hard) connections, the correct connection of the new core is guaranteed.

copy

dynamic reconfiguration bitstream 2 (on FPGA)

Restrictions
The restriction on signal timing (cf. Flow 3) also concerns the ow which uses JBits routing function. This is not important for the ow which directly copies the module, since it does not use the autorouter. But this ow uses a hard interface like Flow 1. Hard interfaces often need manual editing.

15

Chapter 2: Dynamic Reconguration

(a) Load a rst bitstream into memory containing the module to congure. The module is connected to a x interface.

(b) Unroute the connections to the interface and remember the source and sink locations.

(c) Load a second bitstream into memory with the actual conguration of the FPGA

(d) Unroute the connections to the interface in the second bitstream. recongurable area

(e) Copy the module from the rst bitstream from step (b) and insert it in the second bitstream from step (d).

(f) Reconnect the new module with the information from step (b)

Figure 2-6
Flow 4: Use of JBits and dynamic routing. The result of this ow is a partial bitstream containing all changes from step (d) to step (f), which will be written to the device.

16

3
Development Platform
In this chapter we describe the development platform which we set up and used for our thesis. We will rst describe the components of the development board we used and then go on to the platform itself, the used hardware and nally the software.

3.1

Xilinx Virtex XCV800 FPGA

The Xilinx Virtex XCV800 FPGA is a high-density FPGA with 800 KGates equivalents. The version used on the board is in a 240-pin HQFP package. The XCV800 FPGA contains 84 columns and 56 rows of CLBs. Additionally the FPGA contains a total of 28 block RAMs with a capacity of 4096 bits each. These block RAMs are fully synchronous dual ported RAMs with independent control signals for each port. An overview of the architecture of the FPGA is given in section 2.2.

3.2

XSV800

The XSV800 prototyping board from XESS is a versatile platform for developing FPGA-Circuits. Its Xilinx Virtex XCV800 FPGA is connected to different interfaces for communication with the outside world. There are two serial ports, one parallel port, an Xchecker interface, a USB port, PS/2 mouse and keyboard ports and a 10/100 MBit Ethernet physical layer interface (Ethernet PHY). Further, there is a 110 MHz RAMDAC for video signal generation and an audio driver which can process audio signals with a resolution of up to 20 bits and a bandwidth of 50 kHz. For local data storage the board provides two RAM banks with 512k 16 bits capacity. A separate 16 Mbit Flash EEPROM can be used either to save the conguration of the FPGA or to store data for use by the FPGA after conguration is complete. Finally, there are some local controls on the board like 4 push buttons, a row of 8 dip switches, two 7 segment LEDs and 10 universal LEDs. 17

Chapter 3: Development Platform A block diagram of the XSV Board is shown in gure 3-1. The dotted elements were not used for our project.
Parallel RS232 RCA SVideo

Figure 3-1
Block Diagram of XSV800Board

MAX232A CPLD XC95108 512Kx8SRAM Virtex FPGA XCV800HQ240 512Kx8SRAM Pushbuttons 512Kx8SRAM DIPs 16MBit Flash LEDs Video Decoder 512Kx8SRAM

20 bit Stereo Codec

PDIUSB P11A

Ether PHY

RAM DAC

Stereo in

Stereo out

USB

PS/2

RJ45

VGA

3.3

The LEON Processor

The LEON is a freely available VHDL model of a 32 bit processor conforming to the SPARC V8 architecture. Originally developed by the European Space Agency, it is now available under the Lesser GNU Public License (LGPL)1 . It is being maintained and further developed by Gaisler Research[3] in Gteborg, Sweden. A simple block diagram of the LEON processor is shown in gure 3-2.

3.3.1 VHDL Conguration


The VHDL model of LEON is fully congurable to permit synthesis for different cache sizes, multiplier units and target architectures. The main conguration is done in the le target.vhd. The basic conguration record consists of entries as shown in table 3 -1. The descriptions cover some of the most interesting or important issues of each conguration option. A complete description of the options can be found either in the VHDL source or in The LEON processor Users Manual[3].
1

http://www.gnu.org/licenses/lgpl.html

18

3.3. The LEON Processor

LEON processor FPU PCI

LEON SPARC V8 Integer Unit

Coproc User I/O ICache DCache AMBA AHB AHB Controller

Timers Memory Controller AMBA APB 8/16/32bits memory bus

IrqCtrl AHB/APB Bridge

UARTS

I/O port

PROM

SRAM

I/O

Not used in our project

Figure 3-2: Block diagram of the LEON processor

Option synthesis iu fpu cp cache ahb apb mctrl boot debug pci peri

Conguration description Target technology is Xilinx Virtex, block proms will be used. Multiplier is optimized for use on an FPGA, no MAC, no FPU, no coproc Type of FPU (none) Type of co-processor (none) 2k instr. cache, 2k data cache. One master on AHB bus. 1 interrupt contr., no PCI. 32 bit memory. Boot from prom, clock 20 MHz, baudrate 38400 Enable disassembling, but no other debugging No PCI interface Enable conguration register, no watchdog timer, no second interrupt controller Table 3-1: Main conguration of LEON

19

Chapter 3: Development Platform The processor can actually be run with 25 MHz on the XSV board, but then placing & routing takes quite long. And since for our project 20 MHz is more than sufciently fast, we reduced the frequency which resulted in a siginifcant speedup of the implementation process.

3.3.2

Booting

When LEON starts, it boots from an internal boot ROM according to our conguration. For the boot prom it is possible to either use block RAMs or distributed logic. When using block RAMs the boot prom is built with the Logic Core Generator from a template (virtex_prom256.xco). The contents of the ROM is taken from the le virtex_prom256.mif, which contains just the bare bit code. When using standard logic cells the boot program can be coded in a VHDL le as a large memory lookup table. This is done in the le bprom.vhd. We rst dened two block RAMs with a total of 256 32-bit as a block prom for the boot process. But later it was better to use standard logic cells so that the number of used block RAMs could be reduced to 14 and thus LEON only uses block RAMs on one side of the FPGA. Together with the area constraints this led to less disturbing lines (cf. section 5.4.2) through the recongurable area.

Pmon
As a simple boot program we used the pmon monitor that comes with the LEON distribution. Pmon is a small boot program which, after initializing the processor, performs some memory checks, activates the serial port, says hello to the world (LEON-1: 1*2048K 32-bit memory, rmw) and then waits for a program to be downloaded on the serial interface. This program is in the S-Record format which can be generated from an executable with the GNU objcopy program. We always downloaded programs with 38400bps, although it would actually be preferable to increase this baud rate since a compiled RTEMS program takes more than 3 minutes to load over the serial interface. But we havent actually tested if it works with higher rates.

Rdbmon
Another boot program is called rdbmon. It allows to plug into the processor with the debugger gdb. Rdbmon provides support for setting breakpoints, single stepping and all the typical debugging tasks like reading memory addresses, disassembling code etc. The plug-in runs over the second serial interface, while stdin and stdout of LEON remain on the rst serial interface. So with the appropriate cable (cf. section 3.3.4) we started minicom, a simple terminal program, on the rst serial interface to capture all the output of LEON. After 20

3.3. The LEON Processor downloading the rdbmon gdb is started with the according executable. The connection to the board is established with (gdb) set remote-baud 38400 (gdb) target extended-remote /dev/ttyS1 and the program downloaded and started: (gdb) load (gdb) run One of the advantages of rdbmon over the simpler pmon is that when plugging in with the debugger the boot program can be recycled. So when the program that was run on LEON has terminated without error, the loader can be resumed with a simple (gdb) jump *0x401f0000 in gdb, whereas with pmon the monitor has to be downloaded again. After restarting, a new program can be loaded with (gdb) file newprog.exe The board has to be contacted again (target ...) and the new program can be downloaded.

3.3.3

Top Level Design

For our level top design we took the le xsv800.vhd which was posted to the LEON mailing list by Stephan Schirrmann[15]. The main task for the top level design le is to connect LEON correctly to the two RAM banks on the XSV board. We later expanded it to also connect LEON to our extensions like audio codec and network interface.

3.3.4

UARTs 1 and 2

To be able to use both of the serial ports of LEON we reprogrammed the CPLD on the XSV board. The CPLD is connected to 4 signals (RxD, TxD, RTS, CTS) of the DB9 connector on the board. But since we dont use ow control we could route the signals of the second serial port using the RTS/CTS lines of this connector. We had to build a special Y-cable which splits these signal to two connectors on the other end.

3.3.5

Synthesis with Synopsys

To synthesize LEON we used the fc2_shell of Synopsys FPGA Compiler 2. It is scriptable, but the commands can also be entered manually. The steps to create a chip from the source VHDL les are:

21

Chapter 3: Development Platform (1) (2) (3) (4) (5) (6) (7) create_project leon add_file -library WORK -format VHDL path/to/file.vhd analyze_file create_chip -name leon -target VIRTEX -device V800HQ240 -speed -4 -frequency 20 -preserve xsv800 current_chip leon optimize_chip -name leon_x export_chip -root leon_x

With (1), the project is built. Then with (2) the source les are added. This command has to be executed once for each source le. A correct order to add the les can be seen in appendix A. The command (3) analyzes all the added VHDL les and checks for syntax errors. The next step (4) is to create a chip and specify the hardware parameters. Now synopsys does not automatically switch to the new chip as current chip, so this has to be done manually (5). Optimizing the chip (6) is the step which takes most of the time. When this is nished, the chip can be written to a le with (7) for further processing. So much for the rst creation of the chip. Now when the VHDL source has been updated (i.e. any number of les have been changed) it is not necessary to perform all these steps again. (1) (2) (3) (4) (5) (6) current_chip leon analyze_file update_chip current_chip leon optimize_chip -name leon_x+1 export_chip -root leon_(x+1)

After switching the current chip back to the unoptimized version (1) all the modied les are re-analyzed (2). Now the chip must be updated (3), optimized as a new version (5) (again the updated chip is not automatically set as current: (4)) and exported again (6). Old designs can be deleted with delete_chip name where name is not the current chip.

3.3.6

Implementation

The implementation was done with Xilinx Foundation, rst with version 3.3, later also with version 4i. It is important to specify the correct constraints le so that the pins of the FPGA are assigned correctly. One pitfall here is that Foundation complains if there are 22

3.4. The Operating System RTEMS too many pins constrained in the le, but if there are some pins unassigned, it just places them somewhere. This way, we once nearly ruined our board when reprogramming the CPLD. The CPLD is connected so that some freely useable pins are connected with the programming pins. So when one of theses pins is tied to a constant logical level it is not possible anymore to reprogram the CPLD. We were somewhat lucky because between the pin which was set to a constant logic level and the programming pin there was a jumper which could be removed to reporgram the CPLD again.

3.4

The Operating System RTEMS

When we were looking for an operating system for the LEON processor, we found that RTEMS from the OAR Corporation[8] was already ported to this architecture. RTEMS is an Open Source Real Time Operating System, and since it has a small kernel it was exactly what we were looking for. So after only a short time our rst "Hello World" program was successfully tested on the XSV board. Programs are compiled with the LECCS tool set, which is actually the port of the GNU GCC compiler for the LEON architecture. To load the programs on to the LEON, the executable has to be converted to an S-Record le rst. This is done with sparc-rtems-objcopy -O srec program.exe program.srec We denoted all LEON executables with the extension exe so that they wouldnt be confused with normal Linux-Elf executables. After powerup, LEON executes the pmon application which waits for the download of an S-Record le on the serial interface. So this le is sent via the serial interface to the LEON: cat program.srec > /dev/ttySx The serial interface has to be congured to the correct baudrate and transmission parameters (8n1). This is done by starting minicom and conguring the interface. minicom can then be left running, so it will show the standard output of the program running on LEON. The program doesnt have to be started, this is already done by the S-Record loader.

3.4.1

Program Structure

RTEMS is not an operating system like for example Linux which runs on the processor, and then new programs can be loaded and executed. Instead, the user program is written, and then the operating system is linked as a library to the program, so that one large program results which includes both the OS and the user program. 23

Chapter 3: Development Platform In an RTEMS program some basic constants have to be dened for certain features to be enabled during compilation. In our case this was
/* we need a console for communication, mostly debugging */ #define CONFIGURE_APPLICATION_NEEDS_CONSOLE_DRIVER /* we use the clock as a still alive indicator */ #define CONFIGURE_APPLICATION_NEEDS_CLOCK_DRIVER #define CONFIGURE_TICKS_PER_TIMESLICE 50 /* we have a timer for network timeouts */ #define CONFIGURE_APPLICATION_NEEDS_TIMER_DRIVER #define CONFIGURE_MAXIMUM_TIMERS 5 /* to start several separate tasks */ #define CONFIGURE_EXTRA_TASK_STACKS (4 * RTEMS_MINIMUM_STACK_SIZE) #define CONFIGURE_MAXIMUM_TASKS 8 #define CONFIGURE_RTEMS_INIT_TASKS_TABLE /* this is needed for the TCP/IP stack */ #define CONFIGURE_USE_MINIIMFS_AS_BASE_FILESYSTEM #define CONFIGURE_LIBIO_MAXIMUM_FILE_DESCRIPTORS 10

An RTEMS program has no main() function like a normal C program. Instead, after starting the operating system sets up the whole environment and then starts the function Init(rtems_task_argument) as a new thread. This is where the user can start setting up the environment for his own program. Usually, some new threads are started here, and probably some interrupt vectors and timers get installed. We also set up the network stack here. This means we have to give it an Ethernet address, an IP address with netmask, standard gateway etc. After everything has been initialized, the init task is terminated. The system runs on with the different started tasks.

Tasks
When a new task is started, it has to tell the operating system that it may be preempted. That means it is possible to create preemptible as well as non-preemtible tasks. This is done with the rtems_task_mode function:
rtems_mode old_mode; rtems_task_mode(RTEMS_PREEMPT | RTEMS_TIMESLICE, RTEMS_PREEMPT_MASK | RTEMS_TIMESLICE_MASK, &old_mode);

A task in RTEMS is basically a function which is called when the task is invoked. It then either runs until the function returns or the task terminates itself by calling rtems_task_delete(RTEMS_SELF). It can also actively call the scheduler 24

3.4. The Operating System RTEMS with the fuction rtems_task_wake_after(timeout). If the argument timeout is non-zero, the task is sent sleeping for the specied time. Otherwise, just a rescheduling is done and the task is set in the ready queue again immediately.

Interrupts and Timers


Interrupt and timer functions are implemented the same way as tasks: they are functions which are called when the interrupt is triggered or the timer expires. The denition of a timer fuction is as follows: rtems_timer_service_routine timer_routine(rtems_id timer_id, void *user_data) A timer is created with rtems_timer_create(t_name, &timer_id) and started with rtems_timer_fire_after(timer_id, TICKS_PER_SECOND / 25, timer_routine, &timer_data); The pointer *user_data which is an argument in the timer routine can be used to pass some data to the timer function. The last argument timer_data from the calling function rtems_timer_fire_after() is passed for this purpose. The denition of an interrupt handler is this: static rtems_isr eth_interrupt_handler(rtems_vector_number v) Interrupt handlers should only perform a minimal set of actions so that they return soon. They also must not call blocking functions. One function which is allowed to be called is rtems_event_send. So an interrupt handler can check the cause for its invocation and then send a message to a specic task to handle the cause. In the case of the network card the interrupt handler could check whether there is a new frame in the input buffer or the frame in the outgoing buffer has been successfully sent. Depending on the outcome of this check it could then call either the sending or the receiving task. This feature has not yet been implemented. Instead, the interrupt handler always calls the receiving task.

25

Chapter 3: Development Platform

26

4
Network
4.1 Overview
As one of the main goals of our thesis was to show an application of a recongurable system on an FPGA, we decided to implement two different audio codecs which could be exchanged at runtime while the rest of the system on the FPGA continued to run uninterrupted. To be able to practically demonstrate this scheme, we had to play sound on each audio codec. But to play sound, a large quantity of data is necessary. So we had to somehow transport this data to the LEON. One possibility would be to use the serial interface which is the standard input and standard output of the LEON processor. The data rate on the serial interface is 38400 bps, which is about 3800 bytes/s. But sound at an acceptable quality has a sampling rate of at least 11.025 kHz with a sample size of 8 bits mono. This results in 11025Hz 8bits = 88200bps So we decided to implement a network interface for LEON. The processor itself only occupies about half of the FPGA, so there is still enough room for a small network interface. On a 10 Mbps Ethernet interface, the useable data rate is about 10 M bits s 1520 fbytes rame 820 f rames s

Maximal payload of an Ethernet frame is 1500 bytes. But with overhead from IP of 20 bytes and UDP of 8 bytes, the useable data size is 1472 fbytes rame . So we can transmit data with no more than

27

Chapter 4: Network

1472

bytes f rames M bytes 820 1.15 f rame s s

Audio data with a samplingrate of 44.1 kHz, 16 bit stereo has a data rate of 44100Hz 16bit 2channels 172 kbytes s

So the data rate on the network interface should be more than enough to stream music with CD quality and even implement a simple ow control.

4.2

Ethernet

The structure of an Ethernet frame is shown in table 4 -1. Each frame starts with a preamble of 7 bytes with the bit pattern 10101010. This is for the receiver to synchronize his clock to the sender clock. The next byte is the start of frame delimiter (SFD). It signies the end of the preamble and hence the start of the actual frame. Its value is 10101011. The header of the frame starts with two addresses. Ethernet addresses are 48 bits long. The rst one is the destination address, followed by the source address. The type eld denotes what the frame contains. Alternatively, it can be information about the length of the eld. If the value is less than 1500, it denotes the length of the following data. Otherwise, it indicates what the data contains. A value of 0x0800 denotes an IP packet, whereas 0x0806 stands for an ARP packet. In case the payload of the Ethernet frame is an IP or ARP packet, no length information is needed since both the IP and the ARP header contain the information of the lenght of the packet. A more detailed description of the Ethernet protocol can be found in [12] (chapter 4.3.1, IEEE Standard 802.3 and Ethernet). It is actually possible to transport more data in an Ethernet frame than the size of an IP packet. The additional bytes are then just ignored at the receiver side. Our implementation actually uses this feature, because with the current implementation every packet that is passed to the network card is automatically stuffed to a length which is a multiple of 4 bytes. As said above, an Ethernet frame can contain up to 1500 bytes of data. But the whole frame (without preamble) must have a length of at least 64 bytes. This means that the data portion must be between 46 and 1500 bytes. If the packet in the frame is less than the required size, the frame can be lled with the optional pad eld. Finally, the CRC is calculated over the whole frame from the destination address to the padding.

28

4.3. Prerequisites Preamble 7 SFD 1 Dst Addr 6 Src Addr 6 Type 2 Data 01500 Pad 046 CRC 4

Header: 14 Table 4-1: Ethernet frame

Body: 461500

4.3
4.3.1

Prerequisites
Hardware

The XSV-Board already contains an Ethernet PHY[10]. The LXT970A chip on the board can drive both a 10 or 100 Mbps Ethernet line. But since the master clock on the board is only 20 MHz we could not implement a 100 MBps network card (100 MBps = 10 ns/bit but the clock of 20 MHz = 50 ns). The chip is a line driver and delivers the data in nibbles1 . For this purpose, it generates its own clock of approx. 2.5 MHz. The processing of these nibbles to reassemble the Ethernet frames is usually done by the network card. Thus we wrote the network card for the LEON ourselves in VHDL. The main state machines of the network card, the sender and the receiver process, both run with this slow network clock from the PHY. Therefore, there have to be certain synchronization mechanisms between the PHY and the processor core, such as processor signals that are too short to be noticed by the slow running processes, or in the other direction signals that would be way too long if sent directly to the processor.

4.3.2

Software

On the side of the operating system, there already exists a complete implementation of a BSD network stack. This stack contains all the ususal protocols such as IP, ARP and ICMP on the network layer and below, or TCP, UDP on the transport layer. But we still had to write the network card driver which actually communicates with the hardware of the network card.

4.4
4.4.1

Network Hardware
Architecture

The whole network card consists of basically three parts. These are the address decoder which communicates with the CPU, and the sender and receiver processes which communicate with the line driver chip. Then there are some additional components such as the CRC checker and calculator (it is the same for both the sender and receiver, but instantiated twice, once for each).
1

one nibble is half a byte (4 bits)

29

Chapter 4: Network And there needs to be some buffer space to save the ethernet frames which are currently being processed. This space is implemented in the form of two FIFOs, one in each direction. They are both large enough to hold one Ethernet frame. These FIFOs are implemented using block rams on the FPGA. They are not put into the RAM because in this case it would have been necessary to write a memory controller which acts as an arbiter between the network card and the CPU when accessing memory. And the operating system would have to hand over some of the memory to the network card which could not be used otherwise. The three parts of the network card each run with a different clock. The address decoder uses the CPU clock, whereas the clocks for the sender and the receiver are provided by the line driver. Thats why there are some synchronization circuits in the design. The overall architecture of the network card is depicted in gure 4-1.
Leon iosn wren oen 32 28

Addr

Data

Address Decoder 32 cpu clock Sync tx clock send 32

Data

FIFO A rden

FIFO B 32 wren numBytes 11

Data

Sync rx clock intr 8 32 CRC Data

32

8 CRC 32

Sender

Receiver

Data

tx_en

PHY

Figure 4-1: Architecture of the network card

30

rx_dv

interrupt

4.4. Network Hardware

4.4.2

Address Decoder

The address decoder attaches to the I/O port of LEONs the memory bus. Depending on the address the processor reads or writes, the address decoder performs different operations. A list of all possible actions is shown in table 4-2. The addresses mentioned here address 32 bit words. So for the actual address, the number given here has to be multiplied by 4. For addresses 3, 4 and 5 it doesnt matter what value is written, the actual value is not even used in the address decoder. It is just the fact that they are written that is important. All addresses are relative to the base address of the network card which we set to be the rst address of the I/O memory, 0x200000000. Addr 0 000 Dir w Function write Description Data written to this address is saved in the sender FIFO. The network card doesnt read this data until it gets the signal to start sending. But then the whole FIFO is read and sent. So no more than one full packet can be written to the FIFO before invoking the sender. When a new packet has arrived and successfully been placed in the FIFO, it can be read from this address. Data contains the full Ethernet header starting with the Destination Address. At the end after the actual data there is some stufng to the next 4 byte boundary followed by the CRC of the Ethernet frame and the result from the CRC checker. The status of the two FIFOs is coded as follows: Bit Function 0 FIFO A empty 1 FIFO A full 2 FIFO B empty 3 FIFO B full When this address is written the network card starts sending the data in the sender FIFO. Flush the FIFO A. Flush the FIFO B. After receiving a packet from the network card, reading this address results in the number of bytes that last packet consists of. This number includes the CRC and the calculated CRC value at the end of the packet.

001

read

010

status

3 4 5 6

011 100 101 110

w w w r

signal reset A reset B #Bytes

111

not used

Table 4-2: Detailed description of the memory locations of the network card

31

Chapter 4: Network

4.4.3

FIFOs

The two FIFOs are dual ported FIFOs[6] with a capacity of 511 32 bit each. Thus they both can hold one full Ethernet frame. This buffer capacity is actually not enough to provide reliable network service in all cases. Because when frames arrive very close after each other the processor is not fast enough to read the rst frame out of the FIFO until the next one gets written into it and the FIFO reaches its capacity quite fast. And since our implementation does not check whether there is free space in the FIFO, this results in loss of data. There is still room for improvement there. Either the FIFO has to be enlarged, so that it can hold more than two full ethernet frames, or there has to be more than one buffer to hold the incoming frames, where they can be stored in a round robin fashion. The version with the larger FIFO should work, as the network card counts the number of bytes of each frame, and the software can ask the card about this number. So it is actually possible to have more than one frame in the FIFO but the software only reads exactly one frame. One problem with this solution arises in the case of frames with a wrong CRC. At the moment when there is never more than one frame in the FIFO at a time the receiver process can just ush the FIFO if the incoming CRC is not correct. But when there is the possibility that previous packets are still in the FIFO it is not possible anymore to just ush it to get rid of the last frame. So with this solution the CRC check always has to be done in software. We did not use larger FIFOs because there were not enough block RAMs available on the FPGA to resize the FIFOs. So the correct implementation would be to use several different buffers for the incoming frames. So each new incoming frame gets written to a different buffer. When the CRC check shows that the current frame is corrupted the according buffer can just be ushed to get rid of the frame. The two ports of the FIFOs have different clocks. This poses no difculty, as the asynchronous FIFOs are read save, and then they are only written from one side (and actually only read from the other). So no extra synchronization circuit is necessary.

4.4.4 CRC
The design of the CRC generator was taken from the VHDL XSV Board Interface Projects of the University of Queensland, Australia[16]. It was then improved by Michael Lerjen at the Computer Engineering and Networks Laboratory, ETH Zurich to process one byte each clock cycle. This is necessary since we are running the CRC generator with the slow network clock compared to the Queensland project where it runs on the fast processor clock. Whenever the signal CRCNewByte is asserted on a rising clock edge, a new byte is processed. In most cases though, this signal is deasserted between two consecutive bytes, as they are not read as fast from the FIFOs. But in the sender, when the last 32

4.4. Network Hardware word is being transmitted, the speed has to be increased, and one byte is fed to the generator each cycle (cf. section 4.4.6).

4.4.5

Receiver

The basic ideas for the receiver (as well as for the sender) process have also been taken from the project at University of Queensland. The receiver waits until the PHY signals that new data has arrived. It then reads the data into the FIFO. At the current state of development, data is written to the FIFO in any case, i.e. there is no control whether the destination address is the same as the MAC address of the interface. In short, the interface is in promiscuous mode. The state machine of the receiver is shown in gure 4 -2.
2 rx_dv = 1 /resetCnt=1 Idle ReceiveSFD
CRCNewFrame=1

12

12

ReceiveDestMac /resetCnt=1

ReceiveSrcMac

4 rx_dv=0 Recevie TypeField Receive Data SignalData SendCRC intr=1

Figure 4-2: State machine of the receiver Most of the states are more or less self-explanatory. The signal rx_dv (receiver data valid) is from the network PHY and is asserted as long as valid data is being sent to the card. The numbers in the loops of the different states depict the number of clock cycles the state machine remains in these states before proceeding. There is an internal counter to implement this. Since the state machine receives one nibble of data from the PHY in each clock cycle, these numbers are twice the length of the according eld of the Ethernet frame. The state SignalData is used to synchronize data at the end of the frame. In the FIFO, only 32 bit words are saved. But it is not specied that the length of an Ethernet frame has to be a multiple of 4 bytes. So when the signal rx_dv is deasserted by the PHY, the remaining bytes are written to the FIFO and the rest is lled with null bytes. While receiving, data is also sent byte-wise to the CRC generator. After all data has been received, the CRC could be checked and in case it is wrong, the frame could be 33

Chapter 4: Network thrown away. This is not yet implemented. But the calculated CRC is also written to the FIFO at the end (after data has been stuffed with null bytes), so it is also possible to do the CRC check in software. As soon as all data is received, the receiver generates an interrupt to the CPU. This interrupt is being synchronized so it can be adjusted to the CPUs needs. Since all data is written to a FIFO, the CPU, while reading data from the FIFO, cannot tell when the end of a frame is reached. For this reason the receiver counts the number of bytes it sends to the FIFO. So when the receiver task is triggered by the interrupt handler, it can ask the receiver about the length of the frame so it can perform the correct number of reads on the FIFO to remove just the whole Ethernet frame. As stated in the section about the FIFOs (cf. section 4.4.3) this design leads to problems as soon as there is more than one frame in the FIFO at a certain time. While the case where frames arrive too fast can be dealt with in software it gets harder in case of a collision. When the collision occurs and the jamming sequence is sent, only parts of the frame have been transmitted. So the frame is actually invalid, but parts of it have already been sent to the FIFO. Now it is up to the software to recognize the corrupted frame and discard it.

4.4.6 Sender
The sender has to take data from the FIFO and send it to the PHY. It gets a signal from the CPU when there is data to be sent. The sender then reads this data, calculates the CRC and sends the data and CRC to the PHY. The sender is implemented as a state machine as shown in gure 4 -3.
doSendFrame =1 16

Idle

SendPreamble

SendData

ShiftNibbles

fifo_empty=1 9 8 15

CalcCRC

SendCRC

Wait

Interrupt intr=1

Figure 4-3: State machine of the sender

34

4.4. Network Hardware Again, most states are self-explanatory. In the state SendData the sender reads a 32 bit word from the FIFO. The rst nibble is immediately sent and the state machine proceeds to the state ShiftNibbles. There, the second nibble of the current byte is sent. While sending, every byte is sent to the CRC generator. This is done in the state SendData. But at the end, before getting the nal CRC, we have to send 4 null bytes to the CRC generator. So after the last 32 bit word has been read from the FIFO (the signal fifo_empty goes to 1), the state is changed to CalcCRC. There the speed at which bytes are sent to the CRC generator is doubled, so it is possible to send 4 additional null bytes while still sending the last 4 data bytes. Thus, when the last nibble is sent, the CRC is calculated and can be transmitted right away. This is then done in the state SendCRC. The Ethernet specication says that after a successful send, the line has to be quiet for at least 12 byte times (24 cycles). To ensure that the state Wait was introduced. At the end, it is possible to generate an interrupt to tell the CPU that we have nished sending the current packet. This interrupt is generated, but it is not being forwarded to the CPU. This interrupt can be the same as the receive interrupt, but then it must be possible for the CPU to somehow distinguish these two events. One idea is to map the still free memory address for this purpose. So when the CPU receives this interrupt, this memory location states which of the two really happened.

4.4.7

Possible Improvements

The most important possible improvements, as stated in the previous sections, are summarized in table 4 -3. Description CRC Problem The receiver calculates the CRC of the incoming frame, but it delivers this frame regardless of the correctness of the CRC. Possible Solution The hardware checks for a correct CRC and delivers the frame only if the CRC matches. Optionally, this could be congureable. It was quite helpful for debugging purposes to be able to catch all frames, even those with a wrong CRC.
Continued on next page. . .

35

Chapter 4: Network
. . . continued from previous page

Description Destination Address

Problem The receiver does not perform address checking on incoming frames.

Possible Solution Usually, network cards only accept frames which have the correct destination address (the address of the card, a multicast or the broadcast address). To deliver all frames, it can be set in promiscuous mode. Collision detection must be implemented in the sender and receiver. They are alerted by the PHY when a collision occurs. The sender should then wait and retransmit the frame, the receiver should delete the already received data. The internal buffer space of the network card should be improved. There should be several buffers which are used in a circular fashion. Now the CPU has time to process one frame until this particular buffer space is needed again. Instead of taking FIFOs as buffer space between the address decoder and the sender/receiver linear addressable memory blocks should be used. Best would be to reserve a certain area in the RAM of the processor. This way the card can write data directly, and the driver doesnt have to copy each frame again.
Continued on next page. . .

Collisions

Memory

Neither the sender nor the receiver care about collisions on the Ethernet. The sender just sends its frame and signals success. And the receiver receives data as long as the PHY signals valid data. The network card has not enough memory to store more than one Ethernet frame. With fast transmissions, this is not enough.

FIFO

The received frames are stored in a FIFO. To be read by the software, the driver has to read the same memory address again and again. But it could be implemented more efciently if the buffer was addressable in a linear way.

36

4.5. Software
. . . continued from previous page

Description Interrupt

Problem The sender does not know when the network card has sent the frame because it just doesnt get informed about that.

Possible Solution The sender should generate an interrupt when a frame has been successfully sent. This interrupt tells the driver that there is a send buffer free for new data. (This also works with several send buffers. Then the drivers knows about these buffers and lls them all. Then he waits for an interrupt until sending the next frame).

Table 4-3: Possible hardware improvements

4.5

Software

The software, that had to be written, consists of a network driver for the network card described in section 4.4 and an application program that runs on the LEON processor. This program reads data from the network and sends it to the audio codec. Thats about all there is to do on the LEON processor. But to show that LEON is still alive, we implemented an additional simple task which prints a running clock on the console. And nally we wrote a program on a Linux box which reads audio data from a le and sends it on to the network in the format the audio codec on th FPGA needs.

4.5.1

Streaming Data to LEON

To send data over the ethernet network to LEON, a streaming protocol is needed. For this task, we used the UDP protocol to transport data on to LEON. It is not necessary to take TCP with its handshake and retransmission features, because audio data must be ready at a certain point in time, so if a packet is lost it makes little sense to retransmit it, as it will arrive too late, and there will be an audible break in the music. We actually tried to implement the stream using TCP. But the stack is not controllable by the software, meaning it sends packets on his own which cannot be prevented by any means except changing the protocol stack. After the handshake has taken place, the stack tries to gure out the maximal windows size as fast as possible. When the rst packet with data has been delivered it immediately sends another empty packet with just the ACK bit set. In the current conguration the network driver is thrown off track by this second packet (cf. section 4.4.3). On the other hand, if a packet is lost, it is just omitted when sending to the audio codec. In this case, the music just coughs once and then continues with the next data packet. 37

Chapter 4: Network But even though, we implemented some ow control mechanism. The PC sends a packet with data to LEON, which writes it to the audio FIFO. Then the audio codec plays this data. As soon as LEON has written all data to the FIFO, it sends a packet back to the PC conrming the data just written. Then the PC sends the next junk. With this protocol, it is simple to prevent LEON from being ooded with packets, and a possible buffer underrun is also efciently handled. Another possibility would be to make the PC send data at exactly specied times. But then it would be necessary to calculate this intervals and stick very tightly to them. Because if the interval is just very little too short, LEON will nally be forced to drop data to prevent being overrun. With too long an interval LEON will eventually run out of data leading to nasty breaks in the music.

4.5.2

Driver

The driver basically consists of the functions interrupt handler, sender daemon, receiver daemon and initialization. The driver was written according to the RTEMS network manual[9]. Thus it resembles in large parts the example driver Generic 68360. The sender and the receiver are both RTEMS tasks. The sender gets a signal from the stack if it has to send data, the receiver gets a signal from the interrupt handler when new data has arrived. During the whole processing in the network stack, the frame is saved in a special data structure called mbuf. The structure is quite complicated and contains different unions and other structs, but the important elds are a buffer to save the whole Ethernet frame, a pointer to the start of the data (which doesnt have to be the start of the buffer), and a pointer to the start of the IP packet.

Receiver
The receiver rst allocates a small pool of mbufs. Each mbuf gets external storage for exactly one Ethernet frame. It then waits for the interrupt. Once this has arrived, it rst reads from the network card the number of 32 bit words the packet consists of. Then the actual data is read. The data is written to the next available mbuf from the chain. There is a small pitfall here, since the Ethernet header has a size of 14 bytes. When the whole frame is just copied into the buffer, then the IP packet starts on a non-aligned memory address which causes LEON to trap. Thats why the header and the body are copied separately into the mbuf with a space of two bytes inbetween. The network card delivers the Ethernet frame with the CRC eld and the result of the CRC check included. So it is actually possible to program a packet sniffer which captures all packets, including such with a wrong CRC if the hardware doesnt do CRC checking. This feature was particularly useful during development of the driver, because on the PC, the network card rejects packets with wrong CRC eld, so the software never even gets these packets, and thus they are not visible. The only indication for their existence was the blinking LED on the hub in the test network.

38

4.5. Software After having copied the packet and marked the header and the body, the mbuf is passed on to the next higher layer of the stack. Finally the used mbuf must be replaced and the counter shifted on to the next mbuf in line to receive a frame.

Sender
The sender waits for the stack to pass mbufs containing frames to send. Once the signal from the next layer has come, it rst clears the sender FIFO. The rst mbuf from the packet is dequeued and interpreted. If it is the rst mbuf (the ag M_PKTHDR is set) the length of the whole frame is saved for later use. The data is then copied into a sender buffer after it has been checked that there wont be a buffer overow. Finally, the mbuf is freed. If all mbufs have been read, the length of the received frame is checked against the advertised length from the rst mbuf. There is again a minor pitfall here. It is possible that the higher network layers produce mbufs with no data and the length eld set to zero. But with the implementation here this should not pose a problem. Now it is time to adjust the lenght. First we have to ensure that the minimal lenght requirement of an ethernet frame is fullled. And then we are sending words of 32 bits, but the length information is in bytes. As soon as all the data has been written to the FIFO the network card is signaled to begin sending. The sender task has to adjust statistics and reset certain variables, before it can begin waiting for the next frame to send. Actually, at that time it should wait for the interrupt the card generates when the frame is really sent.

4.5.3

Application on LEON

The application program on LEON is quite simple. It just reads data from the Ethernet and then writes the sound samples to the FIFO. The sound transport protocol runs over UDP. In the packet, there is just raw data, as produced by sox (and probably the ADPCM coder). LEON takes these samples and sends them unaltered to the FIFO of the audio codec. The only difculty here is to use the correct byte order. The host computer is a PC running Linux and therefore byte order is little-endian. LEON is a SPARC implementation which is big-endian. In spite of our owcontrol mechanism, there are still some packet losses. To improve the program and recover after a packet loss, the program starts a timer just before calling the blocking receiver function recvfrom(). So when no packet arrives in a certain time period, the timer function ushes the network FIFOs, resends the

39

Chapter 4: Network conrmation packet and resets the timer to a longer period (the interval is actually doubled). With this modication it is usually possible to simply recover from a packet loss. Of course the data in the lost packet is also lost, but with audio streaming this is not really a problem since in this case the played music just coughs once before going on. Even when sending ADPCM data a packet loss doesnt pose a problem. After such a loss the signal is off the correct values by a constant. This should make no difference in the played sound as long as the signal does not saturate. When it nally saturates, it should produce some strange noise but at the same time the offset by which it is off the correct signal diminishes. So when the signal leaves saturation again it is closer to the original than before. That means it is actually self-adjusting after a packet loss, with only minor implications on the sound quality. In practice, playing ADPCM produced some strange results. After a packet loss, the sound sometimes grew louder, sometimes quieter, sometimes only one channel was heard afterwards. The problem here is that the network card doestt work correctly on a packet loss. To ensure that LEON is still alive, we wrote a task showing a running clock on the standard output. In between it checks standard input for user commands. That way it is possible to implement arbitrary user controlled functionality into the program easily. We did some functions for debugging (such as show FIFO status, ush FIFOs, show network card statistics). Also a volume control was implemented. When pressing +, the volume is increased, when pressing - it is decreased. The volume control is implemented by bitwise shifting the samples to the left (volume is positive) or to the right (volume is negative). So the amplitude of the samples is multiplied/divided by 2 with each level of volume. This volume control only works for PCM data. When sending ADPCM the volume has to be set to 0 as it doesnt change the samples in this case. If the volume is not 0 when sending ADPCM data the result is only loud noise.

4.5.4

Application on PC

The program on the PC it is called sound has to stream the data on to LEON. It opens up two UDP connections, one to send data, the other to receive conrmations from LEON. It then reads a junk of data, sends this to LEON and waits for the conrmation. Now for the two audio codecs we have to create two different formats of audio data, PCM and ADPCM data. The sound-Application doesnt distinguish between the two formats, so the difference has to be made before. PCM data is just plain audiosamples, 16 bit signed, stereo, with a sampling rate of 39063 Hz (this sampling rate is a fraction of the processor clock of 20 M Hz : 40

4.5. Software 39.0625kHz 29 = 20M Hz ). But to create ADPCM data we need raw data which can then be converted using the ADPCM library[11]. Thus the rst task is to create raw data. This is something the sound program sox[17] can do. It reads wav les, converts them to raw data while at the same time adjusting the sample rate, including the application of the necessary lowpass lter. Since the function lowp of sox is only a lowpass of rst order, the edge frequency is chosen at 85% of the new bandwidth. sox -t wav infile.wav -r 39063 -s -w -t raw outfile.raw lowp 16600 Now we are able to play wav les on LEON. On the internet there are some radio stations that stream mp3 music2 . To be able to play these streams we rst had to be able to send mp3 les to LEON. There are basically two possible solutions to this: to send mp3 data to LEON and decode it in hardware or to decode it in software on the host and send raw data. Playing raw data was already implemented so we looked for a program to convert mp3 les to raw data. The program mpg123 can either write wav les: mpg123 -w outfile.wav infile.mp3 or, to be able to directly stream data through, it can also write raw data: mpg123 -s infile.mp3 | sox -r 44100 -s -c 2 -w -t raw -r 39063 -s -c 1 -w -t raw - lowp 16602 | sound The nal step is to load the stream from the Internet and pipe into mpg123. For this we misused the text based webbrowser lynx. lynx -useragent=xmms/1.2.4 -source http://IP:PORT | mpg123 -y -s - | sox -r 44100 -s -c 2 -w -t raw -r 39063 -s -c 1 -w -t raw - lowp 16602 | sound - 1460

http://www.shoutcast.com

41

Chapter 4: Network

42

5
Implementation of a dynamic recongurable System
In this chapter, we describe the implementation of the recongurable system we have developed in our thesis. Thereby we have targeted a specic application, which should demonstrate a partial recongurable system and the problems associated with it. Figure 5 -1 illustrates this application. Figure 5-1
audio format 1

CPU

sound core 1

Application of a dynamic recongurable System


DAC

partial dynamic reconfiguration

CPU

audio format 2

sound core 2

DAC

We have a CPU core and a sound core on an FPGA. The sound core is connected to the CPU and receives audio data in a specic format. It then decodes the audio data and sends it to the DAC on the development board. The sound core can only decode 43

Chapter 5: Implementation of a dynamic recongurable System and play one specic audio format. If we need to play another format, we replace the sound core with partial dynamic reconguration. One key point is, that the CPU isnt affected by the partial reconguration and continues to run. To realize dynamic reconguration, we have used ow 4 described in section 2.4.4. We have implemented both versions of ow 4. To distinguish the two, we named the ow from gure 2-6 the Dynamic Routing Flow and the ow from gure 2-5 the Direct Copy Flow.

5.1

Partitioning
Figure 5-2
Partitioning of our recongurable system: A static part with LEON and the Interface and a dynamic part with the Virtual Component

Our recongurable System consist of a static and a dynamic part (see Figure 5-2).
I/OPads

Data Bus Interface VC Signals Virtual Component LEON

Adress Bus

static

dynamic

LEON: The LEON SPARC Processor is the central component. Refer to chapters 3 and 4. Interface: The interface is the key part for a successful reconguration. It has to be designed and physically constrained well, in order to allow the replacement of the Virtual Components. Virtual Component: The Virtual Component is the only recongurable unit. It temporarily provides a certain function to the system like playing a specic audio format. It can be replaced on demand with another Virtual Component.

5.2

Interface

We had three choices to attach an extra component to the LEON processor (cf. gure 3-2): Connect to AMBA AHB bus: The LEON processor uses the ABMA AHB bus for high-speed data transfers. There are currently two slaves attached to the AHB bus: the memory controller and the APB bridge. Connect to AMBA APB bus: The AMBA APB bus is connected via the AHB/APB bridge to the AHB bus. Data transfers are slower than with the AHB bus but less complex.

44

5.3. Virtual Components Connect to memory bus: The LEON processor supports a special address space for memory mapped I/O. Memory mapped I/O devices can be attached to the address and data bus and be accessed the same way as memory. Writing AHB and APB peripherals is quite complex, so we decided to use the simplest one which is memory mapped I/O. To avoid contention, we had to insert tristate buffers to and from the databus. An address controller sets the control signals, depending on the specied address. To loosen the dependency of the Virtual Component on the processor, we inserted two FIFO queues (see Figure 5 -3).
CLB Macros
FIFO a

Virtual Component 32 VC Data In

Figure 5-3
Detailed view of the interface between the processor, external IOBs and the Virtual Component

Databus
FIFO b

32 VC Data Out 4

FIFO Handshake

Handshake 8 Status 8 Control 4

I/OPads

Audio Signals

We wanted to design the interface to the Virtual Component as exible as possible. This means we should also allow applications other than just playing audio. You could think of a general computation unit which receives data from the processor, manipulates them and sends them back. The VC Data In and VC Data Out buses are intended for this purpose and lead through the FIFOs. Another pair of buses provide instant access to the Virtual Component (Control) and the possibility to read back a status (Status) without passing the FIFOs. For a more detailed explanation of the signals to the Virtual Component, refer to section 5.3.1. As explained in section 2.4, the points where the Virtual Component is connected to the interface must be x and dened, so that the JBits manipulation program can reconnect the replaced Virtual Component. For this reason, all signals to and from the Virtual Component lead through the CLB Macros. A CLB Macro does not add any logic to the system, it only let the signal pass through it, but it can be physically placed to a certain location on the FPGA. In section 5.4 we describe how we constrained the CLB Macros.

5.3

Virtual Components

In our application, a Virtual Component has to decode and play audio in a specic format. We have designed two of them, one playing PCM data and another one playing ADPCM. With this two replaceable cores, we show the mechanism of partial 45

Chapter 5: Implementation of a dynamic recongurable System dynamic reconguration. There are only two components by now, but once you can replace the rst one with the second, you can also replace it with a third or a fourth core. As long as all cores uses the same interface and behave similarly concerning signal timing, the Virtual Components could implement any desired function.

5.3.1

VC Interface

The Virtual Components build a new level in the VHDL hierarchy. Figure 5-4 shows the schematic symbol of the entity and table 5 -1 describes the input and output signals.
32 32

datain
8

dataout
8

Figure 5-4
Virtual Component Entity Schematic Symbol

regin empty full

regout rd_en wr_en mclk lrck

reset clk

sclk sdin

Signal datain dataout regin regout empty full rd_en wr_en mclk lrck sclk sdin reset clk

Type input output input output input input output output output output output output input input

Width 32 32 8 8 1 1 1 1 1 1 1 1 1 1

Description Data input from FIFO a Data output to FIFO b Control input (direct) Status output (direct) input FIFO empty ag output FIFO full ag input FIFO read enable output FIFO write enabe master clock (stereo codec) left/right channel (stereo codec) audio serial data clock (stereo codec) audio serial data (stereo codec) reset signal clock signal

Table 5-1: Input and Output Signals of the Virtual Component Entity Our two Virtual Components are readonly cores. They dont need to return any data. Therefore the dataout, regout, full and wr_en signals are not connected inside the component.

5.3.2

PCM Player

The PCM Player is a simple yet well sounding audio component. PCM means that the incoming data consists of uncompressed raw audio samples. So all that has to 46

5.3. Virtual Components be done is to deliver this data in the right form to the stereocodec. On our XSV800 board there is an AK4520A 20 bit Stereo ADC & DAC1 . Only four signals are needed in our case: MCLK: LRCK: ASAHI KASEI SCLK: SDTI: Master Clock Input Left/Right Channel Input Audio Serial Data Clock Audio Serial Data Input

[AK4520A]

Figure 5-5: Timing for AK4520A Stereo Codec Figure 5 -5 shows the signal timing. LRCK and SCLK are divisors of MCLK. The relation is: fM CLK fLRCK fSCLK where fS is the sampling frequency. To generate these signals, we let a counter run with the system clock (20 MHz). We then assign: mclk <= count(0); sclk <= count(2); lrck <= count(8);
T Digital High Pass Filter

= 256 fS = fS = 64 fS

This results in a sampling frequency of approximately 39.063 kHz. We could have


Theachieved ADC of AK4520A has a digital high pass filterrate for DC offset cancel. Theby cut-off frequency the HPF is a more standard sampling (e.g. 44 .1 kHz) applying anofexternal 0.9Hz at fs=44.1kHz and also scales with sampling rate(fs). oscillator. But since we can pre-convert any audio streams to this rate, there is no T De-emphasis filter

need for it by now.

To produce the serial audio output, we use the same counter to encode fe states, described inthe table 5 -de-emphasis 2. Whenever a certain couter is filter reached, the to Thewhich DAC of are AK4520A includes digital filter(tc=50/15us) by IIR value filter. This corresponds three frequencies (32kHz,44.1kHz,48kHz). The de-emphasis filter selected by DEM0 and DEM1 is enabled for corresponding action is triggered.
input audio data. The de-emphasis is also disabled at DEM0="1" and DEM1="0". 1 Analog Digital Converter, Digital Analog Converter. DEM 1 DEM0 Mode 0 0 44.1kHz 0 1 OFF 1 0 48kHz 1 1 32kHz Table 3 . De-emphasis filter control

47

Chapter 5: Implementation of a dynamic recongurable System Counter (binary) 000000000

Procedure Latch right sample from FIFO into shift register: The 16 bit shift register is loaded with the lower 16 bits of the 32 bit input word. Output Serial Audio Bit: For left and right channel 16 times, output the most signicant bit of the shift register. Shift left shift register: After the output, left shift register to process the next bit. Latch left sample from FIFO into shift register: The 16 bit shift register is loaded with the higher 16 bits of the 32 bit input word. Generate FIFO read pulse: After left and right sample are read from the input word, generate a read pulse to tell the FIFO to output the next word. Table 5-2: Sequence to produce serial audio data

x0xxxx001 x0xxxx010

100000000

110000000

5.3.3

ADPCM Player

The ADPCM Player is more complex than the PCM Player. Adaptive Differential Pulse Code Modulation (ADPCM) codecs are waveform codecs which instead of quantizing the speech signal directly, like PCM codecs, quantize the difference between the speech signal and a prediction that has been made of the speech signal. If the prediction is accurate then the difference between the real and predicted speech samples will have a lower variance than the real speech samples, and will be accurately quantized with fewer bits than would be needed to quantize the original speech samples. The ADPCM Player can be divided into three stages (see Figure 5-6):
4 32 16

decode split
4 16

serialize
(PCM Player)

decode

Figure 5-6: ADPCM Player Stages 1. Read a 32 bit word from the FIFO and split it up in 4 bit nibbles. 2. Decode ADPCM for left and right channel. 3. Serialize left and right PCM data to Stereo-Codec and synchronize left and right channel decoder.

48

5.3. Virtual Components

Splitter
The 32 bit word from the FIFO has the format shown in Table 5-3. n1 (l) means that the rst nibble is for the left channel.
31 27 0

n1 (l)

n2 (r)

n3 (l)

n4 (r)

n5 (l)

n6 (r)

n7 (l)

n8 (r)

Table 5-3: ADPCM word format

When there are words in the FIFO (indicated with the inactive empty signal) the left decoder is enabled for reading. On its request we load word from the FIFO into a shift register which is directly connected to the left and the right decoder with its four highest bits (27 31). Now, the left decoder reads the rst nibble and we enable the right decoder. On the right decoders request, we shift the register left by four bits that the second nibble can be read. This continues the same way until all of the eight nibbles are read (see State Diagram in Figure 5 -7). Note that this does not synchronize the decoders. We presume that the decoders send alternately their read requests (see next section).
(8th nibble)

0
shift register << 4 enable right decoder right decoder read request

not empty read FIFO

2
enable left decoder

left decoder read request load shift register

1st nibble

3
enable right decoder

right decoder read request shift register << 4

2nd nibble

4
shift register << 4

enable left decoder

left decoder read request

9
7th nibble

left decoder read request shift register << 4

enable left decoder

8
6th nibble

right decoder read request shift register << 4

enable right decoder

7
5th nibble

left decoder read request shift register << 4

enable left decoder

6
4th nibble

right decoder read request shift register << 4

5
3rd nibble

enable right decoder

Figure 5-7: Finite State Machine to feed two ADPCM decoders with 4 bit nibbles

Serializer
The Serializer is actually more or less the same as the PCM Player (see Section 5.3.2). The difference is that in this case the left and right channel samples will not be read from the FIFO but directly from the two decoders. Since the decoders have a FIFO interface (see next section), the Serializer has to provide a full ag, to tell the decoders, that they can output the decoded samples. This signal is used to synchronize the two decoders. Both full signals (for the left and right decoder) are most of the time high. The two decoders have decoded a sample (they are much faster) and are waiting for the full signal to become low. When the Serializer wants to load its shift register with the left sample it shortly deactivates the full signal for this decoder. Half a period later the full signal for the right decoder will be deactivated resulting in an exact alternation of the two decoders.

49

Chapter 5: Implementation of a dynamic recongurable System

ADPCM Decoder
In the beginning we intended to insert an IP-core ADPCM decoder. But the only free available ones were either too simple (low quality) or too complex. So we designed one by ourselves. Our ADPCM decoder is a hardware realization of the software decoder from Stichting Mathematisch Centrum in Amsterdam [11] and is fully compatible with their software encoder. The implemented standard is called Intel/DVI ADPCM, which is a 16 bit PCM to 4 bit ADPCM coder and decoder. For details on the used algorithm refer to the C source code. The code is freely available (see appendix D for the license). The following ve steps are needed to convert ADPCM to 16 bit PCM. 1. Get new nibble (delta). 2. Update index: index = index + indexTable[delta] 3. Compute difference (vpdiff ): vpdi = 4. and new predicted value (pred): pred = pred + sign (delta) vpdi 5. Update step value: step = stepTable[index] Since |delta| has only three bits, the difference vpdiff (in step 3) is easily computed with one assignment and three summations: 1. vpdi = step 2. if (delta(0) = 1) 3. if (delta(1) = 1) 4. if (delta(2) = 1) 3 vpdi = vpdi + (step vpdi = vpdi + (step vpdi = vpdi + step 2) 1)
1 |delta| + 2 step 4

With one Adder/Subtracter and ve special purpose registers (see Figure 5 -9), we can process one sample in ve cycles. A Finite State Machine sets all the control signals needed in the data path. Figure 5 -8 shows an abstraction of this state machine. In normal operation, it cycles through states one to ve. Table 5-4 describes these states. Our ADPCM decoder was designed to handle a FIFO interface, both for reading the delta nibbles and to write out the PCM samples. There are four FIFO handshake signals to control the interaction between the decoder and the input an the output FIFO respectively. The empty and the full signal inform the decoder whether it can 50

5.3. Virtual Components

Procedure 1. delta <= fifo pred <= pred +/- vpdiff

2.

index += indexTable vpdiff <= step >> 3 output enable

3.

vpdiff += step >> 2

4.

vpdiff += step >> 1

5.

vpdiff += step step = stepTable read request

Control Signals deltaload <= 1 predload <= 1 add/sub <= not sign amux <= "01" bmux <= 1 vmux <= 1 indload <= 1 vpload <= 1 stepshift <= 1 amux <= "00" bmux <= 0 vmux <= 0 output_en <= 1 vpload <= 1 stepshift <= 1 deltashift <= 1 amux <= "1x" bmux <= 1 vmux <= 1 vpload <= 1 stepshift <= 1 deltashift <= 1 amux <= "1x" bmux <= 1 vmux <= 1 vpload <= 1 stepload <= 1 amux <= "1x" bmux <= 1 vmux <= 1 input_en <= 1

Table 5-4: ADPCM Control Sequence (in pseudo VHDL)

51

Chapter 5: Implementation of a dynamic recongurable System

init /empty

full

full

Figure 5-8
Control Path State Diagram with Initial States (Init, 5i, 1i and 2i), Normal Run States (1 to 5) and Wait States (empty and full).

1
/full

/full

1i 5
/empty /empty empty empty

5i

2 2i

read from the input FIFO (if its not empty) or write to the output FIFO (if its not full). With the input_en signal the decoder reads a new nibble from the FIFO and with the output_en signal it writes the computed sample to the output FIFO. As can be seen from Figure 5 -8 the decoder checks the empty signal before entering state ve (where it sends the read request) and if the input FIFO is empty it enters the empty wait state. Similarly, when the output FIFO is full the decoder enters the full wait state. Although the decoder can handle a FIFO interface it is directly connected to other components without a FIFO in our design. This is possible if the handshake signals are handled. The states init, 5i, 1i and 2i are initial states. States 5i, 1i and 2i have the same functionality as states 5,1 and 2 except that they dont process values from a previous cycle.

Adder/Subtracter
The Adder/Subtracter in the decoder is an IP module from LogiCORE. It has two 16 bit unsigned inputs and a 16 bit unsigned output. The add/sub signal controls whether a summation or a subtraction is made. The o high signal indicates an overow, if the result exceeds the 16 bit bounds. In case of a subtraction o is normally high and goes to low if and underow occurs.

Delta Shift Register


This shift register is loaded with the 4 bit ADPCM nibble. The rst bit, the sign bit, will not be shifted and remains in the register until the next loading. The other three bits, the magnitude, are available for the IndexTable until the signal deltashift goes high for the rst time. Then the three magnitude bits will shift right. The delta0 signal is needed to control amux in the computation of vpdiff in states three to ve. Before the rst shift it has the value of delta(0). After the rst shift it has the value of delta(1) and at last the value of delta(2). If delta0 is 0 in these states, the left multiplexer (amux) has a null vector as output, making the summation ineffective. 52

5.3. Virtual Components

delta
4 sign

Step Table
delta
shift 7 7 8

Index Table

index 89 x 16
indload

sat

8x8
deltaload deltashift delta0 8 stepload stepshift 16

step shift

16 predload add/sub overflow

pred
16 16

1
16

vmux

vpdiff

vpload

0
10 01 00 11
16 amux

1
16

bmux

sample

add/sub

+/
16

overflow

Figure 5-9: ADPCM Decoder Architecture (Data Path)

53

Chapter 5: Implementation of a dynamic recongurable System

Step Shift Register


This register also has a special architecture for the computation of vpdiff in states three to ve. The register has a 16 bit input and output, but the internal width is 19 bits. On stepload 16 bit output from the Step Table is loaded into the lower end of the shift register (see Table 5 -5). The output consist of the 16 higher bits of the shift register. So the rst value, which appears on the output after loading the register is step 3. On each stepshift the register shifts left by one bit. input 15 0 0 step 0 step step 0 step 0 0 output Table 5-5: Step Shift Register States output:
0

19

stepload stepshift stepshift stepshift

0 0 0

0 0 0

step step step step

3 2 1

Index Table and Step Table


The Index Table and the Step Table are combinational ROM look-up tables. They simply output the dened value for an applied address. Step Table only hold 89 values although its input width is seven bit. The other values are zero. The output format of Step Table is 16 bit unsigned and the one of Index Table is 8 bit 2s complement (values from 1 to 8).

Saturation and Number Format


We have to pay attention to two special summations: index = index + indexTable pred = pred vpdi Let us consider the rst summation. We have to restrict index to a range of 0 to 88. So we must saturate if the summation exceeds this bound. Another problem is that Index Table can have a negative output (1) which we want to add to the unsigned index. The summation with 1 actually isnt a problem if we only take 8 bits from the result. To inhibit going below zero and above 88 we compare the result with 88. If its greater than this value there was either an overow or an underow. The highest overow is 88 + 8 = 96. An underow appears when 0 + (1) = 255 (unsigned result). So we can inspect the eighth bit of the result. If its 0 then there was an overow and we set index to 88, if its 1 we set index to zero. The second summation, the computation of the output sample, also needs a special saturation arithmetic. The sample output is a 16 bit 2s complement value with a range from 32768 to +32767. We cant use this format for the summation because of the unsigned adder/subtracter. We shifted the whole scale up by 32768, mapping 54

5.4. Constraining the Design the lowest value to zero and the highest to 65535. We can thus detect an overow or an underow by inspecting o and add/sub. If both are equal and high, there was an overow and we saturate on 65535, if they are low, we saturate on zero. The conversion back to the 2s complement format is simply done by inverting the highest bit.

5.4

Constraining the Design

Constraining the design is essential for partial reconguration. If we want to cut a block from one design and paste it in another one, we have to make sure that nothing from the static part changes (see section 2.4.4). This implies that the static and the dynamic part have to be locally separated. The separation doesnt only concern the placement of logic blocks but also the routing. The routes of the static must not pass the dynamic area, otherwise they may be disconnected after the reconguration. We applied three different steps to constrain our design: oorplanning, guided routing and the insertion of CLB macros. The following sections describe these steps. Figure 5-10 shows a schematic of the result.
Inter face

Figure 5-10
CLB Macros

Constraining the Design: Flooplanning, Guided Routing and CLB Macros (for the Direct Copy Flow).

LEON

Virtual Component

reconfigurable area
routing blocks

5.4.1

Floorplanning

With oorplanning we have dened a static and a recongurable area. We have used Xilinxs Floorplanner and the UCF-Flow2 . This allowed us to write all settings in an UCF le. See appendix B for the UCF le of our design. Area constraints: If the synthesis tool preserves the VHDL hierarchy, you can apply an area constraint to any entity within the design. We did this for the CPU, the Interface and for the Virtual Component. Manual placement: After verifying the effects of oorplanning, we saw that some components didnt obey the area constraints. These components were all tristate buffers (TBUFs) and it looked like they were placed arbitrarily over the FPGA and also
2

UCF: User Constraints File.

55

Chapter 5: Implementation of a dynamic recongurable System in the recongurable area. The explanation of this behavior is that all TBUFs driving the same net have to be horizontally aligned (on the same FPGA row). We utilize TBUFs to connect the external RAM and our FIFO interface to the LEONs databus. What we had to do then, was to nd all TBUFs which are connected to the databus, including the ones inside LEON of the memory controller. Then we had to manually place them to fulll both the horizontal alignment and our oorplanning ideas.

5.4.2

Guided Routing

We have successfully constrained the placement of logic blocks with oorplanning. Unfortunately there arent any similar methods to constrain the routing. Especially there are no constraints in current tools which inhibit a route from the static part to pass our recongurable area. If you have a net connecting a component in the static part on the left side with an IOB on the right side, the route may cause problems. We call the routes which could possibly pass the recongurable area Disturbing Lines. One method to avoid Disturbing Lines is to arrange the IOB location. If all pads of the static part and the static part itself are on the same half, no routes will pass through the other half. This method isnt applicable in our case. We cant freely choose IOB locations due to the x wiring on the development board. We have evaluated four potential solutions for the Disturbing Line problem: Dynamic routing: With the dynamic reconguration tool JBits, we rst detect the Disturbing Line and remember its start and end points. Before we replace the Virtual Component, we unroute the Disturbing Line. After the replacement, we dynamically reconnect the line. But this method causes other problems. Since the Disturbing Line is also in the static part, the partial reconguration will also undesirably affect this part. Another problem is that a disconnection of a static route conicts with the idea of dynamic reconguration. The static part should continue running during the reconguration. The third problem is that the routing algorithm of JBits is much simpler than the one of the conventional tools. For instance no long lines are used. We have rejected this idea because of these problems. Manual routing: Once we have implemented the design, we open it with the FPGA Editor and manually reroute all Disturbing Lines. We saw that sometimes it sufces to reroute a net with the autorouter. With luck, the new route doesnt pass through the recongurable area. Manual routing is extremely time-consuming. You have to assemble the route from short segments. If there is only one Disturbing Line this method would be an option, but not for more. We desire a method which can be integrated in an automatic ow. Anti core: An anti core acts as a placeholder for the later dynamically inserted Virtual Component. One can instantiate the anti core in VHDL as a black box with the 56

5.4. Constraining the Design same connections as the real core. The anti core occupies most of the resources within a CLB resulting that no other logic elements can be placed there. More information on anti cores can be found in [14]. It is not clear how many routing resources an anti core reserves for itself and if it is enough that no other routes can pass this area. Creating an anti core actually needs an existing JBits core. But our Virtual Components are not JBits cores. They are written in VHDL and synthesized and implemented with conventional tools. There may exist a workaround to produce an anti-core out of a netlist, but we didnt follow this way. Guided routing: The idea of guided routing is illustrated in Figure 5 -11. We can not constrain a route itself, but any components connected to that route. We thus insert a passthrough CLB in each Disturbing Line and place them in an reasonable area. A passthrough CLB is actually only a lookuptable with one input and without any logic function. It simply passes the signal through it. One can route up to four lines through one CLB (two slices with each two lookup tables). This CLB can be placed with the usual area constraints in the UCF le. We have combined these CLBs to three routing blocks located around the recongurable area (see Figure 5 -10). In section 5.4.3 we describe how to build a passthrough CLB macro. To decide where to place the passthrough CLBs we still have to open the initial design in the FPGA Editor and inspect the Disturbing Lines. But we have to do this only once and not after every new implementation like in manual routing. The CLBs will add a short delay to the route. This is negligible for most designs, but should be taken into account for highspeed designs.

LEON

reconfigurable area

LEON

reconfigurable area

Disturbing Line x,y

(a) Initial Design with a Disturbing Line

(b) End Result after the Insertion of a passtrough CLB

Figure 5-11: Guided Routing

57

Chapter 5: Implementation of a dynamic recongurable System

5.4.3

CLB Macros

To connect the Virtual Component with the static interface, we have also inserted passthrough CLBs. This is important for two main reasons: 1. With location constraints, we can place the passthrough CLBs to a x location. This allows the JBits manipulation program to nd it and (re-)connect the replaced Virtual Component. 2. With passthrough CLBs we can split up a connection to the Virtual Component in a static and a dynamic part. Only the dynamic part (from the pass through CLB to the Virtual Component will be affected by the reconguration. The following situation explains the need for this. The Virtual Component has connections to IOBs on the left side. A direct connection would imply that the whole net will be recongured. But the route crosses the static part, which would also be affected by the reconguration. If a passthrough CLB is inserted, it can be placed close the Virtual Component keeping the dynamic part of the net short. We have implemented two different connection types with passthrough CLBs: A single stage CLB Macro and a double stage CLB Macro. We have inserted the single stage CLB Macros with the Dynamic Routing Flow and the double stage CLB Macros with the Direct Copy Flow.

Single Stage CLB Macro


A single stage CLB macro is actually the same as a feedthrough macro. It consists of only two slices and can feed four routes through it. We have used the FPGA Editor to create a macro which we could instantiate in our VHDL architecture. The following steps describes the procedure: 1. Open the FPGA Editor and choose File->New. Select Macro and enter a le name (eg. nvc2.nmc). Select a part with the same architecture (size and package is unimportant). 2. Zoom in the Array Window and select the left slice in an arbitrary CLB. Choose Edit->Add to add a slice component. Do the same with the right slice. 3. Doubleclick the left slice to open the Block Window. Click the Begin Editing icon and edit the slice by clicking on the desired resources. Lookuptable functions can be added by clicking the F= icon. When nished, dont forget to save the changes. Figure 5 -12 shows the nal slice. Edit the right slice in the same manner. 4. We now have to add macro pins to all the inputs and outputs. In the Array Window click on a pin and choose Edit->Add Macro External Pin. Give the external pin a meaningful name (without a preceding $ as in the default name). We named the inputs and outputs from and to the interface iin<0>, iin<1>, iout<0> and iout<1>. The pins to the Virtual Component bear the names vin<0>, vin<1>, vout<0> and vout<1>. 5. Save the macro.

58

5.4. Constraining the Design 6. Instantiate the macro in the VHDL code with component nvc2 port ( vin : in vout : out iin : in iout : out end component;

std_logic_vector(1 std_logic_vector(1 std_logic_vector(1 std_logic_vector(1

downto downto downto downto

0); 0); 0); 0));

7. For the implementation place the .nmc le in the same directory as the synthesis netlist of the design (.edf). Figure 5-12
Block Window of the FPGA Editor with a passthrough CLB slice.

The single stage CLB macros can be placed with normal location constraints in the UCF le.

Double Stage CLB Macros


The idea of the double stage CLB macros is to achieve a hard connection between the Virtual Component and the interface (see gure 2 -6). Hard means that the inserted macro has prerouted nets (hard routed macro) which remains unaltered in the design. Normally a macro has only soft routes, which will be routed together with the whole design. The advantage of using hard routed macros as connection between the Virtual Component and the interface is that the JBits manipulation program doesnt have to use the routing function. The connection is guaranteed because all modules use the same macros and therefore the same routing resources. If an extracted module is pasted over an existing one, the new module uses the same connections as the old module. A Double Stage Macro consists of four passthrough slices and four hard wired routes (see gure 5 -13). A signal from the interface to the Virtual Component for

59

Chapter 5: Implementation of a dynamic recongurable System

iout<0>

iout<1>

Figure 5-13
Double Stage CLB Macro

top CLB

Slice 1 iin<0> iin<1>

Slice 0

vout<0> bottom CLB

Slice 1

vout<1>

Slice 0 vin<0> vin<1>

example will go from iin over the top CLB pass-through slice and the hard wired route to the bottom CLB passthrough slice and to vout. The proceeding to create the Double Stage CLB Macro is very similar to the one for the Single Stage CLB Macro. New is the creation of hard wired routes. In the FPGA Editor this is done with following steps: 1. Select the source pin of the route. 2. Press the ShiftKey and hold it while selecting the sink of the route. 3. Choose Edit->Add. 4. If Automatic Routing was enabled (in Main Properties) the net is already routed. If not, select the unrouted net and choose Tools->Route->Auto Route. Because the routes are hard, the placement of the Double Stage CLB Macros must be done with caution. An error will occur, if two misplaced macros uses the same routing resource. In our case, we saw that no more than six macros can be placed one upon the other. For the datain and dataout buses for example (see section 5.3.1) we need 16 macros. We have arranged them in a 4 4 array.

5.5
5.5.1

Bitstream Manipulation with JBits


Introduction

JBits SDK is an Application Program Interface to the Xilinx conguration bitstream. This API permits Java applications to dynamically modify Xilinx Virtex bitstreams. Figure 5 -14 shows the design ow for partial reconguration. The whole idea behind partial reconguration is to only make the changes necessary to a device that will bring it into a desired conguration. The partial reconguration model performs this function by determining changes made between the last conguration sent to 60

5.5. Bitstream Manipulation with JBits the device and the present conguration in memory. Then it must create a sequence of packets that will partially recongure the device. After all that, the model will mark the device and memory as synchronized and the process will start over again.
Virtex Bitstream from Xilinx tools

Figure 5-14
JBits Design Flow for partial Reconguration

Java Application

JBits SDK

partial Bitstream

Virtex Hardware

A brief introduction into JBits is given in the JBits Tutorial [14].

5.5.2

Function Blocks

In this section, we explain the functions we have used in our JBits program for the Dynamic Routing Flow and the Direct Copy Flow.

Reading and Writing Bitstreams


To read a bitstream you have to execute following commands: jbits.read(<BitstreamFile>); jbits.enableCrc(true); jbits.clearDirtyFrames(); The last command tells JBits that the FPGA contains the same conguration and is therefore synchronized. To write all the changes since the last clearDirtyFames() to a partial bitstream, you can use: jbits.writePartial(<partialFile>); A full conguration can be written with: jbits.write(<BitstreamOutFile>);

Using JRoute
JRoute uses a database called the Resource Factory. There, the information is stored if a routing resource is used by a route, or if it is available for new routes. This database has to be lled explicitly when you read in a new bitstream. This is done with: 61

Chapter 5: Implementation of a dynamic recongurable System ResourceFactory rf = ResourceFactory.getResourceFactory(jbits); rf.fillResourceFactory(); Only JRoute uses and updates this database. If a connection is made with the set or the makeConnection command the routing resource has to be marked as used in the database.

Routing Table
We have used the object NetPinsList for our routing table. As the name says, it is a list of NetPins. These two classes are both from the RTP package. Although we dont have RTPcores, we uses these helpful constructs as follows: NetPinsList rtable = new NetPinsList(); NetPins net = new NetPins(null); To add a source to a net: net.addSource(<Pin>); To add a sink to a net (the net can have multiple sinks): net.addSink(<Pin>); To add the net to the routing table: rtable.add(net); To add a complete net to the routing table, we use the trace function of JBits. If rtree is the traced Route Tree of a source, we can add the complete net with: net = new NetPins(null); net.addSource(rtree.getPin()); for(int k=0;k<rtree.getBottom().length;k++){ net.addSink(rtree.getBottom()[k].getPin()); } rtable.add(net);

CopyPaste a Module
The extraction of a module from a bitstream is done with a special function, which reads all available resources of a source CLB and writes this conguration to a target CLB. The extractionfunction was initially written by Phil James Roxby, one of the inventors of JBits. Our cotutor Herbert Walder added the setfunction to this class. The setfunction does not reserve the routing resources in the Resource Factory. If JRoute is used after this function, the database has to be updated with the fillResourceFactory() method.

Clock Distribution
The Xilinx Tools activate only those connections of the clocknet to the CLBs which are needed. The clock routing will not be copied with the above function, therefore we have to do our own clock distribution. In the com.xilinx.tools package, there are functions which connect all CLBs within an area or the entire chip to a specic clocknet. 62

5.6. Partial Reconguration

Finding LUT inputs


We saw that the input pins for the Single Stage CLB Macro are not the same for all macros (eg. F1 or G1). The reason for this is that the place&route function in the Xilinx Tools rearranges the LUT inputs if a better routing can be achieved. To nd the single input of a LUT in the Single Stage CLB Macro, we test the input muxes of the LUT. For example to test if the F1 input of slice 1 is used, we do: int[] S1_F1 = jbits.get(<row>,<col>,S1F1.S1F1); if (!Util.Compare(S1_F1,S1F1.OFF)){ //used=true }

Unused Connections
The unused connections are tied to ground. For the Dynamic Routing Flow we saw that an additional slice with a zero lookuptable outputs a groundnet, which is connected to the macro input. But this additional slice is not in the recongurable area, but right besides the macro. These connections need a special treatment, because they will not be copied by the copypaste function. We can detect such connections with the route tracer. If a route from a macro does not end in the recongurable area, it must be a groundnet. Instead of copying this net, we rst remember which macro has this unused input. Then, we set directly the corresponding lookuptable in the target bitstream to zero. The sequence to set the F LUT of slice 0 to zero is: int[] nullLut = Util.IntToIntArray(0,16); // create 0 LUT output nullLut = Util.InvertIntArray(nullLut); // has to be inverted jbits.set(<row>,<col>, LUT.SLICE0_F, nullLut);

5.6

Partial Reconguration

The partial reconguration is done with a special PCII/OCard. We had to reprogram the CPLD on the XSV800 to connect the parallel port to the SelectMap interface of the FPGA, since partial reconguration is not supported for the Slave Serial Mode [13]. With a user program which runs on the conguration PC, we can download full and partial bitstreams to the device.

5.7
5.7.1

Implementation Results
Dynamic Routing Flow

For a rst approach we only downloaded the full conguration bitstreams produced with JBits. With this we wanted to show the functionality of our JBits program. The result of the Dynamic Routing Flow was not overwhelming. We could download the full bitstream with the ADPCMPlayer replaced by the PCMPlayer, but it did not sound clean. Though the module played our PCM music, there was a loud noise interference. We supposed that the datain bus was only partially connected, so we

63

Chapter 5: Implementation of a dynamic recongurable System went back to the JBits program and did some signal tracing. But there, we could not nd any inconsistency. The opposite direction (from the PCMPlayer to the ADPCMPlayer) was even worse. We couldnt even reconnect all the signals of the replaced module. Without cause, the routing function pretended that a certain input pin is already in use and that it can therefore not route our net. But we did a previous reverseUnroute call for this pin, and the tracing function also did not detect any routing resource for this pin either. For these reasons we preferred the Direct Copy Flow, which does not use JRoute at all.

5.7.2

Direct Copy Flow

For the Direct Copy Flow we had no problems building the bitstreams, since the JBits program for this ow is much easier. We also started with testing the full bitstreams. The bitstream with the ADPCMPlayer replaced by the PCMPlayer was almost perfect. We could not hear any noise again. Sometimes there was still a cracking on the loudspeakers. The bitstream with the PCMPlayer replaced by the ADPCMPlayer was not working correctly. When LEON sends the audio samples to the FIFO it seams that they immediately disappear. The FIFO never got lled, even if we didnt tell the module to start playing. So far, we havent found the bug, thats why we concentrated on the other working bitstream. To replace the ADPCMPlayer with the PCMPlayer we have a partial bitstream which should apply only the changes needed, leaving the other part intact. We wrote the full bitstream with the ADPCMPlayer to the device, loaded the software and veried it by playing ADPCM music. Then we sent the partial bitstream to the device. We saw that the reconguration was much faster than the full conguration. The result was the following behavior: The application on LEON did not crash, which was good news. But if we then started playing something (PCM or ADPCM) we heard only some sort of a sawtooth sound. With this we showed that it did recongure something, but not the right way. We couldnt actually solve this puzzle. But from the JBits mailing list, we heard that there are differences between version 2.8, which we used and the older version 2.7 concerning the generation of partial bitstreams. We then ran our program with version 2.7 and the big surprise was that it worked! The reconguration successfully replaced the ADPCMPlayer with the PCMPlayer. Although only in one direction, we had a proof of concept for our methods.

5.7.3

Network

The network connection for our system (cf. chapter 4) was developed simultaneously to the recongurable part described in this chapter. We targeted an end system, where we had both network connection and the reconguration ability. We did not manage to integrate the network in the recongurable ow within the time limit of our thesis. We have two versions now: One version with network connection and

64

5.7. Implementation Results streaming ability, but not recongurable, and one version without network connection, but recongurable. For the recongurable version we had to preload the audio data into the SRAM on the development board. To integrate the network in the recongurable system, we will have to apply the constraints of section 5.4. Since the network uses eight Blockrams, which will inevitably be on the right side of the FPGA, oorplanning and guided routing will be quite demanding to avoid disturbing lines.

5.7.4
LEON:

Design Facts

Size: 3865 slices (41.1 % CLB usage with Virtex XCV800) Blockrams: 14 (50 % BRAM usage) Virtual Components: PCMPlayer: Size: 35 slices (0.4 %) ADPCMPlayer: Size: 430 slices (4.5 %) Interface: Size: 105 slices (1.1 %) Blockrams: 4 (14.3 %)

Hardware/Software Versions used


Xilinx Foundation Series 3.1i (3.3sp7) / 4.1i FPGA Express Xilinx Edition 3.6.0 LEON12.3.7 RTEMS4.5.0jg2 JBits2.7 / JBits2.8

65

Chapter 5: Implementation of a dynamic recongurable System

66

Conclusions
We have built a dynamic recongurable system on an FPGA. The system consists of a static part with the LEON 32bit CPU IP core and a dynamic part with a recongurable unit. To implement the LEON CPU on a Virtex FPGA on the XSV800 development board, we have successfully adapted the VHDL conguration to our target architecture. We have also enabled important features of the development platform and attached them to the CPU. As an operation system, RTEMS proved to be a good choice. It allowed us to write user applications in a convenient manner. We have written a network device driver, which connects our Ethernet interface to the operation systems network stack. The network interface we have implemented on the FPGA gives the processor a faster connection to an attached computer. An application on LEON can successfully establish an UDP connection and receive and send data over it. Although the network card is not yet capable to handle every possible cause of error, in a simple environment with no collisions it works awlessly. On top of this development environment, we have built a demonstrator of a dynamic recongurable system. We have implemented two dynamic recongurable units, which are both audiocodecs. The objective was to stream selectively PCM or ADPCM audio data from a PC over Ethernet to our FPGA, where depending on the incoming data either the PCM or the ADPCM codec would play the datastream. For the ADPCM module we have developed our own ADPCM decoder unit. We could successfully test the static system (without reconguration). Both modules played the appropriate audio stream clearly. Also the network connection and the application on LEON worked as we expected. The next step was to dynamic replace the modules on the FPGA. Therefore we have worked out a ow which allows us to design the entire system with the mainstream synthesis tools and then to use the bitstream manipulation tool JBits for the dynamic replacement. To enable this ow, we had to constrain the designs with oorplanning and the insertion of hard macros. We nally managed to dynamically replace one audiocodec with the other and could herewith demonstrate the working of our concept. Due to the time limit of our thesis, our system is not yet elaborated. For example we did not manage to integrate the network interface in the dynamic ow. The dynamic reconguration itself could only be demonstrated in one direction (from ADPCM to 67

Conclusions PCM). Thus, we can not consecutively replace the two modules as we targeted in our vision. Since our ow depends heavily on JBits, which is still in development, we attribute some of the problems to bugs in this software. In spite of these problems, we believe to have attained our goals. On the one hand we have built a versatile development platform containing the LEON CPU with various interfaces. On the other hand we could demonstrate a dynamic recongurable system.

68

Future Work
As mentioned before our system is not yet elaborated. In this section we give some ideas to improve the current version.

LEON CPU
Speed: The current version now runs at 20 MHz. With dedicated oorplanning and minor changes in the VHDL code, it should be possible to achieve a noticeable speed up for the CPU. New Versions: The LEON is continuously improved by Gaisler Research [3]. We utilize LEON1-2.3.7. At the end of our thesis LEON2-1.0.2 was released. New versions could have valuable improvements for our system, e.g. a DMA controller. BootLoader: Our version boots now from a simple monitor program from Blockrams or distributed memory on the FPGA. We then have to download the user application over the slow serial interface. A new boot concept would be very helpful. For example the usage of the FLASH-PROM on board could be evaluated. It is also possible to download user application over Ethernet into memory.

Network Interface
Collisions: The current version does not handle collisions. Collided frames are still partially received. Therefore a collision detection should be implemented. Destination Address checking: Now the receiver does not check the destination address and accepts every frame. Frame Buffer: Instead of Blockram FIFOs, a more suitable buffer for the incoming and outgoing Ethernet frames should be evaluated. It would be helpful if more than one frame at the time could be stored. Interface to the CPU: It could be advantageous to attach the network interface as an AMBA AHB device to the CPU, to allow high speed data transfers, also with a potential DMA controller.

Virtual Components
Decoder State for ADPCM: If now a junk of the ADPCM stream gets lost, the decoder looses its state, resulting in strange effects like increasing or de69

Future Work creasing the volume. If one enables to set the decoder state explicitly, this state could be sent in the beginning of every Ethernet frame. PCM Stereo Player: The PCMPlayer is still mono and could be enhanced to stereo. Other Virtual Components: Our interface allows other virtual components than audiocodecs. For example different dynamic accelerators could be implemented.

Constraining
AntiCores: To avoid disturbing lines, we have not tried to utilize anticores. This method should also be taken into account. Other Development Board: The main reason for disturbing lines is the development board we use. If we could freely assign the IOBs of the FPGA, we could prevent most of the disturbing lines.

Dynamic Reconguration
Debugging: We still do not know, why the dynamic reconguration works with one design and not with the other. There is a difference in bitstream generation in JBits versions 2.7 and 2.8. The knowledge of this difference and of other bugs may be the missing key for a successful reconguration. Alternatives to JBits: It could be possible to replace the modules with other tools than JBits for the direct copy ow.

70

A
LEON VHDL les
The order in which the les have to be added in synopsys is as follows:
add_file add_file add_file add_file add_file add_file add_file add_file add_file add_file add_file add_file add_file add_file add_file add_file add_file add_file add_file add_file add_file add_file add_file add_file add_file add_file add_file add_file add_file add_file add_file add_file -library -library -library -library -library -library -library -library -library -library -library -library -library -library -library -library -library -library -library -library -library -library -library -library -library -library -library -library -library -library -library -library WORK WORK WORK WORK WORK WORK WORK WORK WORK WORK WORK WORK WORK WORK WORK WORK WORK WORK WORK WORK WORK WORK WORK WORK WORK WORK WORK WORK WORK WORK WORK WORK -format -format -format -format -format -format -format -format -format -format -format -format -format -format -format -format -format -format -format -format -format -format -format -format -format -format -format -format -format -format -format -format VHDL VHDL VHDL VHDL VHDL VHDL VHDL VHDL VHDL VHDL VHDL VHDL VHDL VHDL VHDL VHDL VHDL VHDL VHDL VHDL VHDL VHDL VHDL VHDL VHDL VHDL VHDL VHDL VHDL VHDL VHDL VHDL ../leon/amba.vhd ../leon/target.vhd ../leon/device.vhd ../leon/config.vhd ../leon/sparcv8.vhd ../leon/iface.vhd ../leon/macro.vhd ../leon/bprom.vhd ../leon/multlib.vhd ../leon/tech_generic.vhd ../leon/tech_virtex.vhd ../leon/tech_atc25.vhd ../leon/tech_atc35.vhd ../leon/tech_fs90.vhd ../leon/tech_umc18.vhd ../leon/tech_map.vhd ../leon/cachemem.vhd ../leon/icache.vhd ../leon/dcache.vhd ../leon/acache.vhd ../leon/cache.vhd ../leon/apbmst.vhd ../leon/ahbstat.vhd ../leon/ahbtest.vhd ../leon/ambacomp.vhd ../leon/ahbarb.vhd ../leon/lconf.vhd ../leon/fpulib.vhd ../leon/fp1eu.vhd ../leon/ioport.vhd ../leon/irqctrl.vhd ../leon/clkgen.vhd

71

Appendix A: LEON VHDL les


add_file add_file add_file add_file add_file add_file add_file add_file add_file add_file add_file -library -library -library -library -library -library -library -library -library -library -library WORK WORK WORK WORK WORK WORK WORK WORK WORK WORK WORK -format -format -format -format -format -format -format -format -format -format -format VHDL VHDL VHDL VHDL VHDL VHDL VHDL VHDL VHDL VHDL VHDL ../leon/mctrl.vhd ../leon/rstgen.vhd ../leon/timers.vhd ../leon/uart.vhd ../leon/div.vhd ../leon/mul.vhd ../leon/iu.vhd ../leon/proc.vhd ../leon/wprot.vhd ../leon/mcore.vhd ../leon/leon.vhd

Next are the les of the network interface: add_file add_file add_file add_file -library -library -library -library WORK WORK WORK WORK -format -format -format -format VHDL VHDL VHDL VHDL ../leon/crcgenerator.vhdl ../leon/ether_recv.vhdl ../leon/ether_send.vhdl ../leon/fifo.vhdl

For the audio codec, it is either a normal pcm codec: add_file -library WORK -format VHDL ../leon/adpcm/vcaudio.vhd or an adpcm codec: add_file add_file add_file add_file add_file add_file add_file add_file add_file add_file add_file add_file add_file -library -library -library -library -library -library -library -library -library -library -library -library -library WORK WORK WORK WORK WORK WORK WORK WORK WORK WORK WORK WORK WORK -format -format -format -format -format -format -format -format -format -format -format -format -format VHDL VHDL VHDL VHDL VHDL VHDL VHDL VHDL VHDL VHDL VHDL VHDL VHDL ../leon/adpcm/amux.vhd ../leon/adpcm/control.vhd ../leon/adpcm/deltareg.vhd ../leon/adpcm/indexsat.vhd ../leon/adpcm/indextable.vhd ../leon/adpcm/mux.vhd ../leon/adpcm/reg16.vhd ../leon/adpcm/reg7.vhd ../leon/adpcm/stepreg.vhd ../leon/adpcm/steptable.vhd ../leon/adpcm/predreg.vhd ../leon/adpcm/decoder.vhd ../leon/adpcm/vcadpcm.vhd

And nally the top designs: add_file -library WORK -format VHDL ../leon/top.vhdl add_file -library WORK -format VHDL ../leon/xsv800.vhd

72

B
UCF Constraint File
# Timing constraints: (clock to 20 or 25 MHz) NET "clk" TNM_NET = "clk"; TIMESPEC "TS_clk" = PERIOD "clk" 25 MHz HIGH 50 %; # IOB Locations ...

# Manual Placed TBUFs (ext RAM data -> databus) INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST "TRIBUF0_31" LOC = "TBUF_R16C40.1" ; "TRIBUF0_30" LOC = "TBUF_R16C40.0" ; "TRIBUF0_29" LOC = "TBUF_R15C40.1" ; "TRIBUF0_28" LOC = "TBUF_R15C40.0" ; "TRIBUF0_27" LOC = "TBUF_R14C40.1" ; "TRIBUF0_26" LOC = "TBUF_R14C40.0" ; "TRIBUF0_25" LOC = "TBUF_R13C40.1" ; "TRIBUF0_24" LOC = "TBUF_R13C40.0" ; "TRIBUF0_23" LOC = "TBUF_R12C40.1" ; "TRIBUF0_22" LOC = "TBUF_R12C40.0" ; "TRIBUF0_21" LOC = "TBUF_R11C40.1" ; "TRIBUF0_20" LOC = "TBUF_R11C40.0" ; "TRIBUF0_19" LOC = "TBUF_R10C40.1" ; "TRIBUF0_18" LOC = "TBUF_R10C40.0" ; "TRIBUF0_17" LOC = "TBUF_R9C40.1" ; "TRIBUF0_16" LOC = "TBUF_R9C40.0" ; "TRIBUF0_15" LOC = "TBUF_R8C40.1" ; "TRIBUF0_14" LOC = "TBUF_R8C40.0" ; "TRIBUF0_13" LOC = "TBUF_R7C40.1" ; "TRIBUF0_12" LOC = "TBUF_R7C40.0" ; "TRIBUF0_11" LOC = "TBUF_R6C40.1" ; "TRIBUF0_10" LOC = "TBUF_R6C40.0" ; "TRIBUF0_9" LOC = "TBUF_R5C40.1" ; "TRIBUF0_8" LOC = "TBUF_R5C40.0" ; "TRIBUF0_7" LOC = "TBUF_R4C40.1" ; "TRIBUF0_6" LOC = "TBUF_R4C40.0" ; "TRIBUF0_5" LOC = "TBUF_R3C40.1" ; "TRIBUF0_4" LOC = "TBUF_R3C40.0" ; "TRIBUF0_3" LOC = "TBUF_R2C40.1" ; "TRIBUF0_2" LOC = "TBUF_R2C40.0" ;

73

Appendix B: UCF Constraint File


INST "TRIBUF0_1" LOC = "TBUF_R1C40.1" INST "TRIBUF0_0" LOC = "TBUF_R1C40.0" ; ;

# Manual Placed TBUFs (VC Status / FIFO Handshake -> databus) INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST "bififo0/TRIBUF2_31" LOC = "TBUF_R16C45.1" ; "bififo0/TRIBUF2_30" LOC = "TBUF_R16C45.0" ; "bififo0/TRIBUF2_29" LOC = "TBUF_R15C45.1" ; "bififo0/TRIBUF2_28" LOC = "TBUF_R15C45.0" ; "bififo0/TRIBUF2_27" LOC = "TBUF_R14C45.1" ; "bififo0/TRIBUF2_26" LOC = "TBUF_R14C45.0" ; "bififo0/TRIBUF2_25" LOC = "TBUF_R13C45.1" ; "bififo0/TRIBUF2_24" LOC = "TBUF_R13C45.0" ; "bififo0/TRIBUF2_23" LOC = "TBUF_R12C45.1" ; "bififo0/TRIBUF2_22" LOC = "TBUF_R12C45.0" ; "bififo0/TRIBUF2_21" LOC = "TBUF_R11C45.1" ; "bififo0/TRIBUF2_20" LOC = "TBUF_R11C45.0" ; "bififo0/TRIBUF2_19" LOC = "TBUF_R10C45.1" ; "bififo0/TRIBUF2_18" LOC = "TBUF_R10C45.0" ; "bififo0/TRIBUF2_17" LOC = "TBUF_R9C45.1" ; "bififo0/TRIBUF2_16" LOC = "TBUF_R9C45.0" ; "bififo0/TRIBUF2_15" LOC = "TBUF_R8C45.1" ; "bififo0/TRIBUF2_14" LOC = "TBUF_R8C45.0" ; "bififo0/TRIBUF2_13" LOC = "TBUF_R7C45.1" ; "bififo0/TRIBUF2_12" LOC = "TBUF_R7C45.0" ; "bififo0/TRIBUF2_11" LOC = "TBUF_R6C45.1" ; "bififo0/TRIBUF2_10" LOC = "TBUF_R6C45.0" ; "bififo0/TRIBUF2_9" LOC = "TBUF_R5C45.1" ; "bififo0/TRIBUF2_8" LOC = "TBUF_R5C45.0" ; "bififo0/TRIBUF2_7" LOC = "TBUF_R4C45.1" ; "bififo0/TRIBUF2_6" LOC = "TBUF_R4C45.0" ; "bififo0/TRIBUF2_5" LOC = "TBUF_R3C45.1" ; "bififo0/TRIBUF2_4" LOC = "TBUF_R3C45.0" ; "bififo0/TRIBUF2_3" LOC = "TBUF_R2C45.1" ; "bififo0/TRIBUF2_2" LOC = "TBUF_R2C45.0" ; "bififo0/TRIBUF2_1" LOC = "TBUF_R1C45.1" ; "bififo0/TRIBUF2_0" LOC = "TBUF_R1C45.0" ;

# Manual Placed TBUFs (FIFO b -> databus) INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST "bififo0/TRIBUF1_31" LOC = "TBUF_R16C46.1" ; "bififo0/TRIBUF1_30" LOC = "TBUF_R16C46.0" ; "bififo0/TRIBUF1_29" LOC = "TBUF_R15C46.1" ; "bififo0/TRIBUF1_28" LOC = "TBUF_R15C46.0" ; "bififo0/TRIBUF1_27" LOC = "TBUF_R14C46.1" ; "bififo0/TRIBUF1_26" LOC = "TBUF_R14C46.0" ; "bififo0/TRIBUF1_25" LOC = "TBUF_R13C46.1" ; "bififo0/TRIBUF1_24" LOC = "TBUF_R13C46.0" ; "bififo0/TRIBUF1_23" LOC = "TBUF_R12C46.1" ; "bififo0/TRIBUF1_22" LOC = "TBUF_R12C46.0" ; "bififo0/TRIBUF1_21" LOC = "TBUF_R11C46.1" ; "bififo0/TRIBUF1_20" LOC = "TBUF_R11C46.0" ; "bififo0/TRIBUF1_19" LOC = "TBUF_R10C46.1" ; "bififo0/TRIBUF1_18" LOC = "TBUF_R10C46.0" ; "bififo0/TRIBUF1_17" LOC = "TBUF_R9C46.1" ; "bififo0/TRIBUF1_16" LOC = "TBUF_R9C46.0" ; "bififo0/TRIBUF1_15" LOC = "TBUF_R8C46.1" ; "bififo0/TRIBUF1_14" LOC = "TBUF_R8C46.0" ; "bififo0/TRIBUF1_13" LOC = "TBUF_R7C46.1" ; "bififo0/TRIBUF1_12" LOC = "TBUF_R7C46.0" ; "bififo0/TRIBUF1_11" LOC = "TBUF_R6C46.1" ; "bififo0/TRIBUF1_10" LOC = "TBUF_R6C46.0" ; "bififo0/TRIBUF1_9" LOC = "TBUF_R5C46.1" ; "bififo0/TRIBUF1_8" LOC = "TBUF_R5C46.0" ; "bififo0/TRIBUF1_7" LOC = "TBUF_R4C46.1" ; "bififo0/TRIBUF1_6" LOC = "TBUF_R4C46.0" ;

74

INST INST INST INST INST INST

"bififo0/TRIBUF1_5" "bififo0/TRIBUF1_4" "bififo0/TRIBUF1_3" "bififo0/TRIBUF1_2" "bififo0/TRIBUF1_1" "bififo0/TRIBUF1_0"

LOC LOC LOC LOC LOC LOC

= = = = = =

"TBUF_R3C46.1" "TBUF_R3C46.0" "TBUF_R2C46.1" "TBUF_R2C46.0" "TBUF_R1C46.1" "TBUF_R1C46.0"

; ; ; ; ; ;

# Manual Placed TBUFs (databus -> FIFO a) INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST "bififo0/TRIBUF3_31" LOC = "TBUF_R16C44.1" ; "bififo0/TRIBUF3_30" LOC = "TBUF_R16C44.0" ; "bififo0/TRIBUF3_29" LOC = "TBUF_R15C44.1" ; "bififo0/TRIBUF3_28" LOC = "TBUF_R15C44.0" ; "bififo0/TRIBUF3_27" LOC = "TBUF_R14C44.1" ; "bififo0/TRIBUF3_26" LOC = "TBUF_R14C44.0" ; "bififo0/TRIBUF3_25" LOC = "TBUF_R13C44.1" ; "bififo0/TRIBUF3_24" LOC = "TBUF_R13C44.0" ; "bififo0/TRIBUF3_23" LOC = "TBUF_R12C44.1" ; "bififo0/TRIBUF3_22" LOC = "TBUF_R12C44.0" ; "bififo0/TRIBUF3_21" LOC = "TBUF_R11C44.1" ; "bififo0/TRIBUF3_20" LOC = "TBUF_R11C44.0" ; "bififo0/TRIBUF3_19" LOC = "TBUF_R10C44.1" ; "bififo0/TRIBUF3_18" LOC = "TBUF_R10C44.0" ; "bififo0/TRIBUF3_17" LOC = "TBUF_R9C44.1" ; "bififo0/TRIBUF3_16" LOC = "TBUF_R9C44.0" ; "bififo0/TRIBUF3_15" LOC = "TBUF_R8C44.1" ; "bififo0/TRIBUF3_14" LOC = "TBUF_R8C44.0" ; "bififo0/TRIBUF3_13" LOC = "TBUF_R7C44.1" ; "bififo0/TRIBUF3_12" LOC = "TBUF_R7C44.0" ; "bififo0/TRIBUF3_11" LOC = "TBUF_R6C44.1" ; "bififo0/TRIBUF3_10" LOC = "TBUF_R6C44.0" ; "bififo0/TRIBUF3_9" LOC = "TBUF_R5C44.1" ; "bififo0/TRIBUF3_8" LOC = "TBUF_R5C44.0" ; "bififo0/TRIBUF3_7" LOC = "TBUF_R4C44.1" ; "bififo0/TRIBUF3_6" LOC = "TBUF_R4C44.0" ; "bififo0/TRIBUF3_5" LOC = "TBUF_R3C44.1" ; "bififo0/TRIBUF3_4" LOC = "TBUF_R3C44.0" ; "bififo0/TRIBUF3_3" LOC = "TBUF_R2C44.1" ; "bififo0/TRIBUF3_2" LOC = "TBUF_R2C44.0" ; "bififo0/TRIBUF3_1" LOC = "TBUF_R1C44.1" ; "bififo0/TRIBUF3_0" LOC = "TBUF_R1C44.0" ;

# Manual Placed BlockRAMs (FIFO) INST INST INST INST "bififo0/fifob/B7" LOC = "RAMB4_R2C1" ; "bififo0/fifob/B11" LOC = "RAMB4_R3C1" ; "bififo0/fifoa/B7" LOC = "RAMB4_R0C1" ; "bififo0/fifoa/B11" LOC = "RAMB4_R1C1" ;

# FIFO Area Constraints AREA_GROUP "AG_bififo0" RANGE = CLB_R1C43:CLB_R16C50 ; AREA_GROUP "AG_bififo0" RANGE = TBUF_R1C43:TBUF_R16C50 ; INST bififo0 AREA_GROUP = AG_bififo0 ; AREA_GROUP "AG_bififo0/fifoa" RANGE = CLB_R1C45:CLB_R8C50 ; INST bififo0/fifoa AREA_GROUP = AG_bififo0/fifoa ; AREA_GROUP "AG_bififo0/fifob" RANGE = CLB_R9C45:CLB_R16C50 ; INST bififo0/fifob AREA_GROUP = AG_bififo0/fifob ;

# LEON Area Constraints AREA_GROUP "AG_leon0" RANGE = CLB_R1C1:CLB_R56C40 ; AREA_GROUP "AG_leon0" RANGE = TBUF_R1C1:TBUF_R56C40 ;

75

Appendix B: UCF Constraint File


AREA_GROUP "AG_leon0" RANGE = RAMB4_R0C0:RAMB4_R13C0 ; INST leon0 AREA_GROUP = AG_leon0 ; # Virtual Component Area Constraints AREA_GROUP "AG_vcaudio" RANGE = CLB_R24C55:CLB_R40C78 ; INST "vctop0/vcaudio_1" AREA_GROUP = AG_vcaudio ;

# Guided Routing (get rid of disturbing lines) INST INST INST INST INST INST INST INST INST "FTRAM2ADR_1" "FTRAM2ADR_2" "FTRAM2ADR_3" "FTBAR_2" LOC "FTRAM1ADR_1" "FTRAM1ADR_2" "FTRAM1ADR_3" "FTBAR_1" LOC "FTSER" LOC = LOC = CLB_R1C41.*:CLB_R15C45.*; LOC = CLB_R1C41.*:CLB_R15C45.*; LOC = CLB_R1C41.*:CLB_R15C45.*; = CLB_R1C41.*:CLB_R15C45.*; LOC = CLB_R51C41.*:CLB_R56C43.*; LOC = CLB_R51C41.*:CLB_R56C43.*; LOC = CLB_R51C41.*:CLB_R56C43.*; = CLB_R51C41.*:CLB_R56C43.*; CLB_R51C41.*:CLB_R56C43.*;

INST "TBUFRXD_1" LOC = TBUF_R52C82:TBUF_R56C84; #INST "TBUFRXD_2" LOC = TBUF_R52C82:TBUF_R56C84;

# Manual Placement of Double Stage CLB Macros # VC Data In / VC Data Out INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST INST "vctop0/VCIDATA_0" LOC = CLB_R6C78.*:CLB_R24C78.*; "vctop0/VCIDATA_1" LOC = CLB_R6C77.*:CLB_R24C77.*; "vctop0/VCIDATA_2" LOC = CLB_R6C76.*:CLB_R24C76.*; "vctop0/VCIDATA_3" LOC = CLB_R6C75.*:CLB_R24C75.*; "vctop0/VCIDATA_4" LOC = CLB_R7C78.*:CLB_R25C78.*; "vctop0/VCIDATA_5" LOC = CLB_R7C77.*:CLB_R25C77.*; "vctop0/VCIDATA_6" LOC = CLB_R7C76.*:CLB_R25C76.*; "vctop0/VCIDATA_7" LOC = CLB_R7C75.*:CLB_R25C75.*; "vctop0/VCIDATA_8" LOC = CLB_R8C78.*:CLB_R26C78.*; "vctop0/VCIDATA_9" LOC = CLB_R8C77.*:CLB_R26C77.*; "vctop0/VCIDATA_10" LOC = CLB_R8C76.*:CLB_R26C76.*; "vctop0/VCIDATA_11" LOC = CLB_R8C75.*:CLB_R26C75.*; "vctop0/VCIDATA_12" LOC = CLB_R9C78.*:CLB_R27C78.*; "vctop0/VCIDATA_13" LOC = CLB_R9C77.*:CLB_R27C77.*; "vctop0/VCIDATA_14" LOC = CLB_R9C76.*:CLB_R27C76.*; "vctop0/VCIDATA_15" LOC = CLB_R9C75.*:CLB_R27C75.*;

# Control / Status / Handshake INST INST INST INST INST INST INST "vctop0/VCIHS0" LOC = "vctop0/VCIHS1" LOC = "vctop0/VCIHS2" LOC = "vctop0/VCIREG_0" LOC "vctop0/VCIREG_1" LOC "vctop0/VCIREG_2" LOC "vctop0/VCIREG_3" LOC CLB_R7C56.*:CLB_R25C56.*; CLB_R8C56.*:CLB_R26C56.*; CLB_R9C56.*:CLB_R27C56.*; = CLB_R10C56.*:CLB_R28C56.*; = CLB_R10C55.*:CLB_R28C55.*; = CLB_R11C56.*:CLB_R29C56.*; = CLB_R11C55.*:CLB_R29C55.*;

76

C
Installing and Compiling RTEMS
This is a short introduction to compiling and installing the RTEMS operating system and adding the network driver code. We rst installed LECCS, the compiler suite, into /opt/rtems/. It is necessary to add this path (/opt/rtems/bin) to the environment. We then installed the RTEMS source into /rtems-4.5.0-jg-2. This directory is called source directory in the following instructions. For the installation, we built a special working directory, /rtems_build/. It is important that these two directories are in the same subdirectory (in our case the home directory). Otherwise the compilation will fail. We then congured the source. This is done in the working directory. ../rtems-4.5.0-jg-2/configure --prefix=/opt/rtems --target=sparc-rtems --enable-gcc28 --enable-posix --enable-networking --enable-cxx --disable-multiprocessing --enable-rtemsbsp=leon1 --disable-tests \ \ \ \ \ \ \ \ \

Then the whole system is built by invoking make. After this step, we have a complete version of the whole RTEMS operating system, but still without our network card driver. We now patched the source directory with the patch patch_source and the working directory with the patch patch_work: /rtems-4.5.0-jg-2/ # patch -p1 < patch_source /rtems_build/ # patch -p1 < patch_work

77

Appendix C: Installing and Compiling RTEMS


Then the whole system is compiled again. This step is quite short as only the network card driver gets compiled and included in the libraries. Finally, the system needs to be installed with make install, which requires root privileges since it installs into the directory /opt/rtems/. Now when the network driver has been updated, the source must be compiled again with make followed by make install as root.

78

D
Miscellaneous
Intel/DVI ADPCM coder/decoder Copyright
Copyright 1992 by Stichting Mathematisch Centrum, Amsterdam, The Netherlands. All Rights Reserved Permission to use, copy, modify, and distribute this software and its documentation for any purpose and without fee is hereby granted, provided that the above copyright notice appear in all copies and that both that copyright notice and this permission notice appear in supporting documentation, and that the names of Stichting Mathematisch Centrum or CWI not be used in advertising or publicity pertaining to distribution of the software without specific, written prior permission. STICHTING MATHEMATISCH CENTRUM DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL STICHTING MATHEMATISCH CENTRUM BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

79

Appendix D: Miscellaneous

80

Bibliography
[1] Xilinx Inc. Virtex 2.5V FPGAs. Data Sheets. http://www.xilinx.com/partinfo/ds003-1.pdf [2] Xilinx Inc. Virtex Series Conguration Architecture User Guide. Application Note, XAPP151 (v1.5) September 27, 2000. http://www.xilinx.com/xapp/xapp151.pdf [3] Gaisler Research. The LEON Processor Users Manual. Version 2.3.7, August 2001. http://www.gaisler.com [4] Andreas Haase. Untersuchungen zur dynamischen Rekongurierbarkeit von FPGA. Diploma Thesis, September 2001. Technische Universitt Chemnitz-Zwickau. [5] Asahi Kasei Microsystems Co., Ltd. (AKM). AK4520A 100dB 20-Bit Stereo CODEC. Data Sheet. http://www.asahi-kasei.co.jp/akm/usa/product/ak4520a/ek4520a.pdf [6] Xilinx Inc., LogiCore. Asynchronous FIFO. IP for COREGenerator. http://www.xilinx.com/ipcenter/catalog/logicore/docs/async_fo.pdf [7] Xilinx Inc., LogiCore. Adder/Subtractor. IP for COREGenerator. http://www.xilinx.com/ipcenter/catalog/logicore/docs/addsub.pdf [8] Oar Online Applications Research Corporation. http://www.oarcorp.com [9] Oar Online Applications Research Corporation, RTEMS Network Supplement Edition 1, for RTEMS 4.5.0, 6.September 2000. http://www.oarcorp.com/rtems/releases/4.5.0/rtemsdoc-4.5.0/ share/rtemsdoc/pdf/networking.pdf [10] Intel Corp., Dual-Speed Fast Ethernet Transceiver. http://courses.ece.uiuc.edu/ece311/docs/datasheets/ethernet.pdf [11] Jack Jansen, Centre for Mathematics and Computer Science. Simple 16-bit ADPCM coder and decoder. ftp://ftp.cwi.nl/pub/audio/adpcm.shar [12] Andrew S. Tanenbaum Computer Networks, Third Edition. Prentice Hall. [13] Xilinx Inc. Xilinx Online Partial Reconguration FAQ. http://www.xilinx.com/xilinxonline/partreconfaq.htm [14] Xilinx Inc. JBits Tutorial. JBits version 2.8. JBits@xilinx.com

81

Bibliography
[15] Stephan Schirrmann, Re: [leon_sparc] Xess board implementation, LEON mailing list. http://groups.yahoo.com/group/leon_sparc/message/1259 [16] Peter Sutton, VHDL XSV Board Interface Project, University of Queensland, Australia. http://www.itee.uq.edu.au/ peters/xsvboard/ [17] Chris Bagwell, Sound eXchanger Swiss Army Knife of Sound Processing Programs. http://sox.sourceforge.net

82

Das könnte Ihnen auch gefallen