Sie sind auf Seite 1von 12

IEEE TRANSACTIONS ON COMPUTERS,

VOL. 48, NO. 6,

JUNE 1999

591

Dynamic Reconfiguration to Support Concurrent Applications


Jack S.N. Jean, Member, IEEE, Karen Tomko, Member, IEEE Computer Society, Vikram Yavagal, Jignesh Shah, Student Member, IEEE, and Robert Cook
AbstractThis paper describes the development of a dynamically reconfigurable system that can support multiple applications running concurrently. A dynamically reconfigurable system allows hardware reconfiguration while part of the reconfigurable hardware is busy computing. An FPGA resource manager (RM) is developed to allocate and de-allocate FPGA resources and to preload FPGA configuration files. For each individual application, different tasks that require FPGA resources are represented as a flow graph which is made available to the RM so as to enable efficient resource management and preloading. The performance of using the RM to support several applications is summarized. The impact of supporting concurrency and preloading in reducing application execution time is demonstrated. Index TermsConfigurable computing, field programmable gate array (FPGA), reconfiguration, resource management, scheduling.

DAPTIVE Computing Systems (ACS) have been shown to outperform general-purpose systems for some applications because of their abilities in adapting hardware resources to the application requirements [1], [8], [9], [13], [16]. The technology has been demonstrated for a few special purpose applications which have been tediously hand-coded. These systems also have tremendous promise for accelerating more conventional applications such as domain specific visual development environments (Khoros, MATLAB, WiT) and web browsers (Netscape, Internet Explorer) which dynamically invoke submodules or plugins for image and data processing. Programming a device to support all of the possible submodules an application may invoke is not usually feasible due to the large number of submodules and the finite amount of hardware resources. However, an ACS may support reconfiguration of some hardware resources while some other programmable hardware is busy computing. Such a system is referred to as a dynamically reconfigurable system. A dynamically reconfigurable system can configure the hardware on demand to support the requirements of interactive programs such as MATLAB and web browsers. One way to implement a dynamically reconfigurable ACS is to incorporate a large number of SRAM-based Field Programmable Gate Array (FPGA) chips on a coprocessing board which is used in conjunction with a traditional processor. However, in such a system, there is a need to provide an operating system-like interface for the programmable hardware to hide the architectural details of the coprocessor, to manage reconfiguration of the hardware during application execution, and to fairly allocate FPGA resources among multiple processes.

INTRODUCTION
This paper describes the system software development of a dynamically reconfigurable system that can support multiple applications running concurrently. A block diagram illustrating such a system is shown in Fig. 1, where each application consists of a program to be executed on the host machine and a flow graph representing the portion of the application to be executed on the FPGA resources. The host program is responsible for starting the execution of graph nodes through the resource manager (RM). With the information of multiple flow graphs, one for each application, the RM allocates and de-allocates FPGA resources so that new nodes may be loaded into the system while other nodes are being executed. In addition, a speculative strategy is adopted by the RM in the preloading of FPGA configuration files to reduce and hide the reconfiguration overhead and to improve performance. The FPGA architecture is modular in the sense that the FPGA resources consist of a number of hardware units and each graph node uses an integer number of hardware units. Note that multiple copies of the same application can be executed at the same time. The system has the following technical advantages. . Compared to static reconfiguration schemes, which do not reconfigure the hardware during the execution of an application, the system can accommodate more applications, typically those that require more FPGA resources than what is available, and their usage of FPGA resources can be satisfied once spread out over time. This is particularly true when the loading of some FPGA implementations is based on execution conditions. The system may also reduce the computation time for an individual application. Since all of the required FPGA resources need not be loaded at once, a larger portion of the application computation can be mapped to FPGAs. Compared to other dynamic reconfiguration schemes that statically determine how to reuse the

. The authors are with the Department of Computer Science and Engineering, Wright State University, Dayton, OH 45435. E-mail: jjean@cs.wright.edu. For information on obtaining reprints of this article, please send e-mail to: tc@computer.org, and reference IEEECS Log Number 109708.

0018-9340/99/$10.00 1999 IEEE

592

IEEE TRANSACTIONS ON COMPUTERS,

VOL. 48,

NO. 6,

JUNE 1999

Fig. 1. The dynamic reconfiguration system.

defines the pins and timing of signals used for the host (or, more specifically, the PPGA) to FPGA interface. The XBUS also contains six 16-bit buses that provide interXMOD connectivity. There are two special purpose onboard FPGAs that are not part of any XMOD. They are the PPGA and the CPGA. The PPGA (Xilinx XC4013E-2) controls communication between the host computer and the XMODS (Fig. 2) by acting as the PCI bus interface to the board. The CPGA (Xilinx XC5210-5) implements clock generation, runtime configuration, and power up functions. While the FPGAs can run at clock rates up to 66Mhz, the G900 board and host interface is currently limited to 16Mhz.

FPGA resources [1], [2], the system allocates FPGA resources at run time via an RM that relieves application developers from the management of FPGA resources. Due to use of the RM and its speculative loading policy, multiple applications may share the FPGA resources effectively, very much analogous to a virtual memory system. The RAGE project [3] is similar to our own, but emphasizes partial reconfiguration. It does not support preloading of configurations. Section 2 of this paper describes the development environment of the project. Section 3 shows the design and the implementation of the RM. Several applications are used for the testing of the RM. Those applications, the testing procedure, and the results are summarized in Section 4. Section 5 compares the system to similar software in an operating system. Section 6 concludes the paper.

2.2 Design Environment The G900 board ships with a developer's kit which includes XLINK-OS, GOCOLIB, XLINKLIB, and XL [11], [17].
XLINK-OS permits the host program to execute hardware designs by using standard C function calls and to map variables that exist in the FPGAs into the host program's address space (memory mapped variables). The FPGAs can be reconfigured with the configuration the host program requires. . Two software libraries, GOCOLIB and XLINKLIB, are provided in the developer's kit. GOCOLIB provides low-level routines to interact with the module FPGAs, monitor CPUs and PPGA, etc. XLINKLIB contains higher level routines for interacting with the board. Both XLINKLIB and GOCOLIB need to be linked into every XLINK-OS generated application. . The XL language allows the specification of FPGA operations and is loosely based on C syntax with many keywords the same as C. It provides control of the features available in Xilinx FPGAs. There are the standard C operators plus a clock operator (:) which is used in program sequencing. All XL statements between two clock operators are executed during the same clock cycle. Microsoft NT is used as the operating system. The design process begins with three source files which the user must create. The source code for the FPGA design written in either XL or VHDL. 2. A file describing the host to FPGA interface. It declares memory mapped variables and the functions the host calls to execute FPGA designs. 3. The application program that resides and executes on the host computer. It must call functions to initialize, load, and execute user FPGA designs in the XMODs. The first two files are input into the XL compiler to produce a Xilinx netlist file. That netlist is used by the Xilinx tools to automatically map the design into an FPGA .bit file that contains the FPGA configuration. In addition, a C header file is generated which can be included in the host program to control the executions of FPGA designs on the G900 board. A more detailed description of the development environment is given in [12]. 1. .

DEVELOPMENT ENVIRONMENT

2.1 Hardware Platform The reconfigurable computing platform used in this project is a 180 MHz Pentium-pro personal computer hosting a G900 FPGA board which is a PCI bus-based board manufactured by Giga Operations Corporation (GigaOps). The board has a modular design, making it suitable for resource sharing among applications. This design consists of eight computing modules (XMODs) where each XMOD contains two XC4020E FPGA chips, 2 MB DRAM, and 256KB SRAM (see Fig. 2). Note that a maximum of 16 XMODs can be configured in one G900 board. The XMODs are connected together by 128 wires, called the XBUS. Among those 128 wires, 21 of them are used to support a custom bus protocol, called HBUS, which

Fig. 2. G900 architecture.

JEAN ET AL.: DYNAMIC RECONFIGURATION TO SUPPORT CONCURRENT APPLICATIONS

593

Fig. 3. Overview of resource manager.

DESIGN

AND IMPLEMENTATION

To provide the dynamic reconfiguration capability and to support concurrent applications, an XMOD RM and a set of library functions have been designed and implemented. The system is diagrammed in Fig. 3. With the XMOD as the basic resource unit, the RM allocates and deallocates reconfigurable computing resources both on-demand and speculatively. A set of library functions is provided so that application developers can pass information from an application to the RM without worrying about the details of the interprocess communications or the details of G900 board control. In this section, the application scenario of the system is first described. A detailed design is then presented along with the implementation status. Some discussion of design issues follows.

3.1 Application Scenario In the following paragraphs, we describe the scenario for applications executing with the RM. Both the application development scenario and the application execution scenario with the proposed system are given. 3.1.1 Application Development An application is first analyzed or profiled so as to determine the computations that can be assigned to FPGAs or XMODs. Those computations are mapped to XMODs by creating the design.lnk file and going through the development process described in the previous section to generate a design.bit file that can be downloaded into the XMODs. Remaining parts of the application are assigned to the host program, which also provides data and controls the execution of computations on the XMOD. The computations

mapped to XMODs are represented as a flow graph which is passed to the RM when the application starts executing. An application flow graph is a weighted graph where each node represents XMOD computation and the weighted edges represent the control flow of the host program. The computational granularity of graph nodes may differ greatly and each node requires either a fixed number of XMODs or a range of numbers of XMODs. For example, a node can be either for a simple integer addition that requires one XMOD or for a complicated two-dimensional discrete cosine transform that can use from one to eight XMODs depending on the desired performance. An example edge weighting is shown in Fig. 4. After the execution of graph node A, the next candidate node can be either node B, C, or D, depending on a condition evaluated in the host program. Three weighted edges go out of node A and the weight of each edge represents the estimated probability of the destination node being executed given that node A is being executed. The edge weights are used by the RM to preload FPGA configuration files. Higher weights lead to a higher chance of preloading and zero weight indicates no need for preloading. The edge weights are assumed to be constants during the application execution in this paper, even though the removing of this assumption may potentially lead to better performance. The fundamental assumption of the flow graph is that the computational granularity of graph nodes may differ greatly and may only require a portion of the available FPGA resources. It is therefore not efficient to execute a graph node on an FPGA system one at a time. Instead, multiple nodes, not necessarily from the same application, should be executed concurrently and new nodes may be loaded into the system while other nodes are being executed.

Fig. 4. A flow chart example.

3.1.2 Application Execution During the execution of an application, the RM runs as a background process on the host machine. Each application provides a flow graph and the corresponding FPGA configuration files to the RM. The RM loads or preloads FPGA implementations during the host program execution. The preloading implements a speculative strategy that overlaps XMOD reconfiguration with computation on other XMODs such that the reconfiguration latency is reduced or completely hidden. Because the edges in a flow graph are used only for the preloading of FPGA configuration files, an edge missing in a flow graph does not influence the correctness of the computation. It does, however, influence the execution performance. It is assumed that applications are developed in a way that executing one graph node at a time is sufficient, though not necessary, to guarantee the completion of individual applications. With such applications, the system will be able to prevent deadlock. A set of library functions has been developed to simplify the application development. The library functions support the passing of a flow graph, the demand loading request, the node release request, the board release request, and some XMOD I/O capabilities. When a library function is called from within an application, some information is

594

IEEE TRANSACTIONS ON COMPUTERS,

VOL. 48,

NO. 6,

JUNE 1999

passed to or retrieved from the RM through interprocess socket communication. Initially, the application provides the flow graph, along with the complete pathnames of FPGA configuration files used for each node, to the RM. The RM speculatively loads these configuration files, if free XMODs are available, to reduce and hide the overhead associated with reconfiguration of FPGAs during run-time. When the application needs to do the computation mapped to FPGAs, it requests the RM to load the required bit file in an XMOD. It then waits till the RM responds with the number of the XMOD that has been assigned to the application. If the bit file has been speculatively preloaded, then the application does not have to wait for loading of the configuration file and gets the XMOD number of the assigned XMOD immediately. However, if the node has not been preloaded or there are no free XMODs available, then the application waits until an XMOD becomes available and is loaded as requested. After an XMOD is allocated and loaded, the application packs the input data for computation into an array and sends them to the G900 board. Once the input data has been written to the XMOD, the application initiates computation. On completion of the computation, the function mapped to the FPGA should be designed to interrupt the RM, which in turn will inform the application. Results are retrieved by the application. If the computation is complete for the node represented in the flow graph, the XMOD is released; otherwise, the input, execute, and result steps are repeated. When the application is done with all the computations that have been mapped to FPGAs, then it informs the RM, which will no longer speculate any nodes from the application's flow graph and will release any XMODs preloaded for the application.

3.2 Resource Manager Design The RM is implemented as a multithreaded application. An overview of the design is shown in Fig. 3. The main thread of RM is the first thread to be created and is the parent thread for the other threads. It first initializes the G900 board, then spawns the loader, interrupt handler, and scheduler threads. It also sets up a server socket for incoming connection requests from applications and waits for requests. A new application service thread is created for each requesting application, which then interacts with the application on behalf of the RM. The main thread loops back to listen for new requests. Communication among the different threads of the RM is accomplished through events, mutexes, shared variables, and shared memory. The application service thread establishes a stream socket connection with the client application and services its requests. It receives the application flow graph and puts it into the shared memory and notifies the scheduler. Depending on the type of request sent from the application, the application service thread responds in different ways. There are six types of requests that can be sent from the application.
. Load Graph Node: Request the allocation of XMODs for a flow graph node and load FPGA configuration files to one or more XMODs. If the XMODs have been assigned and preloaded with the configuration

files for that node, then the XMOD numbers are returned to the application immediately. If, however, no XMODs have been assigned, then the application service thread places a demand request for the XMODs with the scheduler. When the XMODs get assigned and loaded with the required files, it returns the XMOD numbers to the application. . Input Data: On receiving the input data array, the application service thread writes the value of each memory mapped input variable at its specified offset within the XMOD. . Result Data: The application thread retrieves the result data from the memory mapped variables on the XMODs and returns this data to the host program. . Execute Function: The application service thread starts execution of a specific function on the XMODs and waits till the interrupt handler indicates the occurrence of an interrupt on one or more of the assigned XMODs. It acknowledges the interrupt(s) and then informs the client application. The interrupt(s) may indicate completion of computation or some intermediate stage. The service thread waits for the next request from the application which might be reinitiation of computation or collection of result data. . Release XMOD: The service thread deallocates all of the XMODs associated with a specific flow graph node. . Release Flow Graph: The service thread will discard the application's flow graph, inform the scheduler that the application flow graph is no longer valid, and then terminate. When an application executes an FPGA function, it normally blocks until the function is completed. The completion of an FPGA function sends an interrupt from an XMOD to the interrupt handler thread of the RM. The thread checks which XMODs have generated an interrupt since more than one XMOD could be interrupting at a time. It then informs the corresponding application service thread about the interrupt. Once all the interrupts have been acknowledged by their respective application service threads, the interrupt handler enables further interrupts and loops back to wait until another interrupt occurs. For each graph node, an application developer needs to either implement an interrupt request circuit in FPGAs or let the host program wait for a prespecified amount of time for the function to complete. The latter approach works only if the function completion time can be known in advance or can be determined in a well formulated way. The scheduler thread, which allocates XMODs either ondemand or speculatively, normally sits idle until being triggered by three different types of events from an application service thread: 1) a request for demand loading, 2) the deallocation of XMODs due to the release of a graph node, and 3) the receiving of a new flow graph. Depending on the type of event, its scheduling parameters, and availability of resources, the scheduler either assigns an

JEAN ET AL.: DYNAMIC RECONFIGURATION TO SUPPORT CONCURRENT APPLICATIONS

595

XMOD to the loader thread for loading or loops back to wait for another event to occur. The scheduling policy accepts three parameters that can be specified as arguments to the RM while invoking it. These parameters determine how aggressively the scheduler speculatively preloads graph nodes. They are defined as follows: MAX_SPECULATE: Maximum number of immediate successor nodes from currently executing node in the flow graph that can be speculatively loaded; . THRESHOLD: Minimum edge weight probability for speculative preloading of the successor node; . FREE_XMODS: Minimum number of XMODs that should not be preloaded and should be kept aside for demand loading requests. The scheduling policy has three sections based on the events that can trigger the scheduler. Each of these sections is separately explained below: 1. Demand Loading: If the node requested for demand loading has been preloaded or is being preloaded, then the number of the assigned XMOD is returned to the requesting application service thread. . If the requested node has not been or is not currently being preloaded, then a free XMOD is searched for and assigned to it for loading. If no free XMOD is available, then any XMOD assigned to the application service thread for some other node is searched for and assigned to it. If no XMOD has been assigned to the application service thread, then an XMOD that has been preloaded or is being speculatively loaded is preempted and assigned. . If all XMODs are executing, then the demand request is queued up in a demand queue and the requesting application service thread is suspended. It is waked up when its demand request is serviced and an XMOD is assigned. . Once an XMOD has been assigned to the requesting application service thread, its bit file is scheduled for loading and the XMOD number is given to the application service thread, which waits till the loading is completed before passing the XMOD number to the client application. . Irrespective of the type of event triggering the scheduler, if there are any demand loading requests pending in the demand queue, then they are given highest priority. New demand requests get queued at the end of the demand queue. Preloading for new or existing applications is done only if free XMODs are available after all the demand requests have been serviced. Arrival of New Application Flow Graph: . While the number of free XMODs is higher than FREE_XMODS, if a new application flow graph . . 3.

arrives, node 0 of the flow graph is preloaded on a free XMOD. All new flow graphs are serviced before speculating existing application flow graphs. . The threshold weight probability for preloading is not considered while preloading node 0 for a new graph under the assumption that the application will always start execution of the flow graph from node 0. Releasing of An XMOD: .

While the number of free XMODs is higher than FREE_XMODS, one immediate successor node of the currently executing node in a flow graph is speculated till the MAX_SPECULATE limit is reached for the currently executing node in a flow graph. If this limit has been reached, the flow graph is skipped. . For a node to be speculatively loaded, its edge weight probability, which is calculated as a fraction of combined edge weights of all outgoing edges from the current node, should be higher than THRESHOLD. . If the node to be speculated is detected to have been executed before on a XMOD, in case of loops, then it is checked if the configuration file is still loaded on the XMOD. If yes, it is simply marked as preloaded; otherwise, the node is loaded on a free XMOD if its edge weight probability is higher than THRESHOLD. . The speculation of flow graphs is done in a circular fashion and continues while the scheduler has not come back to the same application flow graph that it started with, in the present scheduling cycle. . Scheduling begins from the flow graph following the last flow graph scheduled in the previous cycle. In order to efficiently allocate XMOD resources under the speculative loading environment, the RM maintains the state of each XMOD, as shown in Fig. 5. If an XMOD is preloaded but not in use yet, it may be deallocated when there is another request that cannot be satisfied. If an XMOD is loaded on-demand, it cannot be deallocated until it is released. Since loading of a configuration bit file is slow and needs to be done serially on the G900 board, actual loading of bit files is done by the loader thread. This allows the scheduler to provide faster response to demand requests and other scheduling events. The scheduler queues bit files to be loaded in two queues maintained in the shared memory, demand queue and speculation queue. The loader thread serially loads the bit files queued by the scheduler on their assigned XMODs. Bit files in the demand queue are given priority over bit files in the speculation queue. On completion of loading, the application service thread that is waiting for an XMOD is signaled.

2.

3.3 More Design Issues The current design and implementation of the RM supports multiple concurrent applications with preloading. Several

596

IEEE TRANSACTIONS ON COMPUTERS,

VOL. 48,

NO. 6,

JUNE 1999

Fig. 5. XMOD state diagram.

design issues complicate the RM and some have not yet been addressed. These issues are described as follows: 1. Direct XMOD Data Access. The standard mechanism for an application to load data to XMODs or to unload data from XMODs is to use a library function that requires the copying of data between the application and the RM. For applications with frequent data access or large quantities of data, a more efficient implementation that allows individual applications to access those XMODs directly without going through the RM is available. Inter-graph-node Constraints. Some resources on the G900 board other than the XMODs may be shared by different flow graph nodes. For example, the X bus can be used for inter-XMOD communication. If a graph node uses multiple XMODs and some wires from the X bus, such resource requirements should be specified and provided to the RM. Currently, none of the test applications in this paper use the X bus for inter-XMOD communication and the current RM does not examine such constraints. Optimal Resource Allocation. For an application flow graph node, there is a trade-off between the resulting performance and the number of XMODs used. It is expected for most graph nodes that more XMODs do not lead to linear speedup. Therefore, when a range of XMOD numbers is specified for a node, the corresponding performance figure for each number of XMODs can be specified so that the RM may use the information to optimally allocate resources at run time.

2.

2.

Group in source code format was profiled with the Visual C++ Profiler [4]. Two time-consuming functions are the full_search( ) and the dist1( ) functions that handle the motion estimation of the MPEG-2 video encoding algorithm. The part of those two functions that handles forward matching and backward matching has been mapped to XMODs and implemented. The resulting flow graph for the application has only one graph node. That flow graph node can use one to eight XMODs and all the XMODs use exactly the same FPGA design. A more detailed description of the design can be found in [5]. The design was first tested without using the RM (i.e., with static reconfiguration) and the results show that, even though more XMODs do improve the performance, the last few XMODs do not have as much benefits as the first few XMODs. Although not supported yet, the performance figures in the future can be provided to the resource manager to improve resource utilization and overall performance. Satisfiability. The satisfiability problem is the problem of deciding if a formula in conjunctive normal form is satisfied by some truth assignment [15]. For example, the following 4-variable formula is in conjunctive normal form and it can be satisfied when x1 = true, x2 = false, and x3 = false. The formula contains three clauses that are ANDed together. xI y xP y xR exh xI y xP exh xQX Historically, the satisfiability problem was the first decision problem shown to be NP-complete. The satisfiability problem is convenient for testing the RM as different formulae can be tested using the same FPGA design by simply initializing the design with different values. This allows control over the amount of FPGA computation time. A simple FPGA design to exhaustively solve the problem is shown in Fig. 6. Note that the FPGA design was not intended as an accelerator, even though the design was faster than the Pentium host. FPGA designs that are meant to accelerate the satisfiability problem can be found in [18] and [14].

3.

PERFORMANCE RESULTS

Two main applications were used to test the system operation. They are an MPEG-2 encoder program and an application based on an NP-complete satisfiability problem in which we synthesized a flow graph with four nodes, each node exhaustively solving the satisfiability problem for a different logic formula. The two applications are briefly described below. 1. MPEG-2 Encoder. MPEG-2 is a standard for digital video and audio compression. The MPEG2 encoder that is available from MPEG Software Simulation

The FPGA design in Fig. 6 implements a deterministic solution to the satisfiability problem by checking every truth assignment. It works as follows: A formula in conjunctive normal form that contains at most (n+1) clauses is represented as two matrices of binary values where each clause is represented as two binary vectors, A1[ ] and A2[ ]. Each A1[ ] bit indicates if a variable is in a clause and each A2[ ] bit indicates if a variable is negated or not. Those two matrices are initialized by the host program. Each truth assignment is represented as a binary vector, h, stored in an up counter which starts from zero. For each truth assignment, the formula is evaluated by going through the clauses one by one. The host is interrupted when either the formula is satisfied or all the truth assignments have been exhausted. When the formula is satisfied, the host can read the truth assignment, i.e., the h value, that satisfies the formula. This h value is important in the verification of the

JEAN ET AL.: DYNAMIC RECONFIGURATION TO SUPPORT CONCURRENT APPLICATIONS

597

Fig. 6. Satisfiability FPGA design.

system operation. The FPGA design fits in one FPGA chip and, therefore, one single XMOD. Based on the FPGA design, an application was artificially synthesized. The application, called the multiple-satisfiability, contains four graph nodes in its flow graph, where each graph node is for the satisfiability evaluation of a formula. Four formulae were pseudorandomly produced and used in the application. Because those formulae are fixed and specific conditions are used in the host program to determine the control flow, it is known which nodes get evaluated and in what order if the computation is correct. The setup was purposely made to test the speculative loading performance. Note that we pretend that all four graph nodes use different FPGA configuration files to better represent real applications even though in reality the same file is used. Because of this assumption, the execution of a new graph node requires the reloading of the configuration file.

4.1 Simulation Results The satisfiability problem with different numbers of nodes and different node granularity and the MPEG encoder were used for examining system performance. All readings were taken as an average of three independent runs. At first, timing for system operations, such as G900 board initialization and bit file loading, was done to quantify the overheads of using our development system. It was found that the board initialization takes about 2.34 seconds, while the loading of a configuration file takes about 0.35 seconds. Each test application was first run without using the RM and then using the RM to find out the overheads of using the RM. When an application does not use the RM, it needs to initialize the G900 board and map it into its address space. As a result, applications cannot run concurrently without using the RM. When the RM is used, the board is initialized only once when it starts.

598

IEEE TRANSACTIONS ON COMPUTERS,

VOL. 48,

NO. 6,

JUNE 1999

Fig. 7. High granularity satisfiability execution times.

Thus, board initialization time was not counted when the test applications were run using the RM. Note that when the application is not using RM, it does not have a flow graph but generates interrupts to indicate end of FPGA computation.

Fig. 8. Low granularity satisfiability execution times.

4.1.1 TEST 1: Satisfiability Problem with High Granularity Nodes An application with four nodes, each a satisfiability problem having FPGA computation times in the range of 1.9 sec to 4 sec was developed. These node granularities were much higher than the FPGA configuration time of 0.35 sec. One, four, eight, and 12 copies of the application were run sequentially without the RM and both sequentially and concurrently with RM using no speculative preloading. The results obtained are summarized in Fig. 7. For a single application, it took 14.1 sec to complete without using the RM and 13.5 sec when run through the Resource Manager. Ideally, since the board initialization time was not counted in the second case, it should have been 2.34 sec less but, due to the overheads of packing/ unpacking of data and communicating with Resource Manager, it was only 0.6 sec less. Thus, the overhead time of using RM was 1.74 sec for this application. When the application was run four times sequentially without Resource Manager, it took 56.06 sec to complete versus 54.27 sec with Resource Manager. But, when four copies of the application were used concurrently, using the RM it took just 14.87 sec to complete, speeding up the execution by a factor of 3.8 against sequential execution without RM. Since loading of bit files needs to be done sequentially, even when the applications are running concurrently, of the 14.87 sec total execution time, approximately 5.6 sec ( R R HXQS) was spent loading the FPGAs. However, an FPGA on an XMOD in the G900 board can be loaded while FPGAs on other XMODs are executing. Thus, the PCI bus is shared between FPGA loading and data loading/unloading for other FPGAs, which results in longer FPGA configuration time depending on the dynamic condition of the board. The overlap in loading and execution of FPGAs and concurrent execution on different FPGAs results in the reduced total time. When eight copies of the application were run on the G900 board concurrently through the RM, it took 18.17 sec to complete, as compared to 111.26 sec for sequential

execution without the RM and 108.87 sec with the RM. Since one FPGA on all the eight XMODs on the G900 board is used in parallel, a speedup by a factor of 6.1 is achieved over sequential execution without RM, which uses only one FPGA on an XMOD at a time. With 12 copies of the application running concurrently using RM, a speedup by a factor of 8.2 is obtained. Since the number of applications is more than the number of FPGA resources, some applications need to wait till FPGAs become free in the demand queue before obtaining FPGA resources for computation.

4.1.2 TEST 2: Satisfiability Problem with Low Granularity Nodes The satisfiability problem used in test1 was initialized with different formulae to get four nodes with computation times between 8 msec and 26 msec. This was much lower than 350 msec required for loading an FPGA on G900 board with its bit file. It means that much of the application execution time would be spent loading the FPGAs, rather than performing computation on FPGAs. The timings for one, four, eight, and 12 copies of the application were obtained as in test one. The results are shown in Fig. 8. The speedup obtained by running four copies of the application concurrently through the RM is just 2.1, with eight copies it is 2.31, and with 12 copies it is 2.6. The low speedup factor is because there was less parallelism in the application execution on FPGAs since most of the time was spent in sequentially loading the FPGA configurations. Thus, the overhead of communicating with the G900 board through PCI bus and some hardware constraints on the G900, which results in high FPGA configuration time, dictates the minimum node granularity for hiding latency and obtaining impressive speedup for applications running concurrently. 4.1.3 TEST 3: MPEG Encoders A test similar to tests 1 and 2 was done with two MPEG encoders. Each MPEG encoder processes 27 frames of images and uses a single XMOD. The motion estimation part of the MPEG encoding is performed on the XMOD. It requires more frequent and larger amounts of data transfer than the satisfiability problem. For each image frame that is

JEAN ET AL.: DYNAMIC RECONFIGURATION TO SUPPORT CONCURRENT APPLICATIONS

599

Fig. 9. MPEG encoder timings.

used as a reference frame, the whole image frame is sent at once from the host to the XMOD. For the other frames, an image block of IT IT pixels is sent only after the previous block has been processed by the XMOD. The timings obtained by running the encoders sequentially and concurrently are shown in Fig. 9. When a single encoder was run without using the RM, it took 45.35 sec to process the 27 frames, whereas, through the RM, it took 44.85 sec due to the saving of board initialization time. When two encoders were run in parallel using the RM, they ran 1.4 times faster than two encoders executing serially without the RM, which required 91.29 sec to complete the processing. The speed up is not very high due to the large amount of data transfer taking place between the FPGA board and host, but the concurrent use of MPEG encoders is very useful for image processing applications.

Fig. 11. Execution times for four satisfiability problems w.r.t. FREE_XMODS.

that node 1 will be preloaded after node 0, then node 2, and then node 3. This is the actual execution sequence of the application. The application execution times were obtained as a function of the scheduling parameters MAX_SPECULATE and FREE_XMODS, while THRESHOLD was held constant and as a function of THRESHOLD and MAX_SPECULATE, while FREE_XMODS was kept constant at zero.

4.1.4 TEST 4: High Granularity Satisfiability with Speculative Preloading To verify the effect of speculative preloading of FPGAs, based on the flow graphs given to the RM by the applications, four copies of the satisfiability problem used in test 1 running concurrently were reevaluated. The effect of varying the different scheduling parameters on the total execution time of the applications was also obtained. Since there are eight XMODs on G900 board and preloading is done on free XMODs, the number of applications used for testing was limited to four. The flow graph used for the satisfiability problem in this test is shown in Fig. 10. The edge weights are arranged such

4.1.5 Results with THRESHOLD Constant The execution times obtained as a function of FREE_XMODS and MAX_SPECULATE are shown in Fig. 11. THRESHOLD was held constant at 0.3 for these results, which meant that all the nodes in the flow graph shown in Fig. 10 could get preloaded. It can be concluded from Fig. 11 that the speculative preloading of FPGA configurations does help hide the configuration overhead, which results in lower overall execution time. The lowest execution time is obtained when the number of speculated nodes is highest. However, as the number of free XMODs for demand loading increases, the overall execution time increases due to reduced number of speculated nodes. 4.1.6 Results with FREE_XMODS Constant For the results in Fig. 12, FREE_XMODS was held constant at zero so that no XMOD was kept aside for demand loading requests. A THRESHOLD value of 0.5 meant that only nodes 1 and 2 could be preloaded. With THRESHOLD equal to 0.6, only node 1 could be preloaded, with node 2 and node 3 being demand loaded. From the different values of execution time in Fig. 12, it is again confirmed that with reduced number of speculated nodes the overall execution time increases. Fig. 13 shows the timing distribution for a single run of four copies of High Granularity Satisfiability problems running concurrently. In the figure, the loading of bit files and execution of the nodes of the four applications are overlapped as expected, which results in the reduced execution time. In summary, for the four High Granularity satisfiability applications, aggressive preloading was advantageous. The best performance was obtained when THRESHOLD was 0.3, MAX_SPECULATE was equal to 2, FREE_XMODS was 0 and was 17 percent faster than without any speculative loading of nodes.

Fig. 10. Flow graph for satisfiability with high granularity nodes.

600

IEEE TRANSACTIONS ON COMPUTERS,

VOL. 48,

NO. 6,

JUNE 1999

that all successors from each node could get preloaded if there were free XMODs. FREE_XMODS was zero, thus allowing as much preloading as possible. As can be seen in Fig. 15, the gains obtained by speculative loading are reduced considerably when the flow graph representation is inaccurate. However, the execution time is still lower than without any speculation for this test application. It is the responsibility of the application developer to provide as accurate a flow graph as possible unless the dynamic conditions of the program cannot be predicted accurately, in which case, a lower speedup factor by using the RM should be acceptable.

5
Fig. 12. Execution times for four satisfiability problems w.r.t. THRESHOLD.

COMPARISON TO OTHER TYPES MANAGEMENT

OF

RESOURCE

4.1.7 TEST 5: Satisfiability Problem with Nonoptimized Flow Graph To study the effect of a flow graph being provided by the application that did not accurately represent the execution flow, the flow graph shown in Fig. 10 was modified to the flow graph shown in Fig. 14. According to the edge weights, node 2 gets preloaded after node 0 and then node 1 gets preloaded, depending upon THRESHOLD and MAX_SPECULATE values. From node 1, node 3 gets preloaded before node 2 and, even though execution is complete after node 3 is executed, the flow graph indicates that node 1 or node 0 may be executed after node 3. This incorrect representation of the application execution in its flow graph, which could also be due to some condition evaluation differing from the usual case, results in higher number of preemption of nodes and thus increases the overall execution time. The timing values obtained as a function of MAX_SPECULATE are shown in Fig. 15. The THRESHOLD value was kept constant at 0.3, which meant

The way that FPGA resources are managed in this paper in some regard is similar to what is in a virtual memory system that uses prepaging with the following two major differences. . First, while prepaging can only be used for processes that were previously swapped out (and not applicable to new processes), the preloading of FPGA configurations can be applied to new processes because of the availability of the application flow graphs. Given the relatively coarse granularity of graph nodes, providing an application flow graph is arguably very feasible for an application developer, especially since the FPGA design process is typically lengthy and tedious. Second, while a paged virtual memory system is truly modular in that all the pages are treated the same, the FPGA resources may not be really modular. For example, the X bus on G900 can be used to simultaneously support several groups of inter-XMOD communication as long as the usage of X bus pins is disjoint among different groups. The

Fig. 13. Timing diagram for four high granularity satisfiability problems (MAX_SPECULATE = 1, THRESHOLD = 0.3, and FREE_XMODS = 0).

JEAN ET AL.: DYNAMIC RECONFIGURATION TO SUPPORT CONCURRENT APPLICATIONS

601

Fig. 14. Incorrect flow graph for satisfiability.

result is that the execution of one application graph node on XMODs may prevent the execution of another graph node. The requirement of such shared resources by each application graph node should be indicated in the flow graph and sent to the resource manager. Note that this is not supported by the current resource manager. As another example, the FPGA chips on a board may have different I/O capabilities. Since the FPGA resources are mainly used for computing, instead of data storage, the RM is, in a sense, similar to the processor scheduler in a multiprocessing operating system. However, because of the assumption of data dependency and conditionals in an application flow graph, deterministic scheduling techniques, as summarized in [6], cannot be applied to the RM. Instead, techniques such as the one in [7] that is based on the availability of run-time profile at compile-time are more applicable. Such techniques can be used to help produce application flow graphs which were always manually generated in this paper.

Fig. 15. Timing diagram for four high granularity satisfiability problems (MAX_SPECULATE = 1, THRESHOLD = 0.3, and FREE_XMODS = 0).

CONCLUSIONS

A dynamic reconfiguration system that can support concurrent applications has been designed, implemented, and tested on a PC with a G900 FPGA board. Compared to static reconfiguration schemes, the proposed system can accommodate more applications and potentially reduce computation times for individual applications. Compared to other dynamic reconfiguration schemes, the proposed system allocates FPGA resources at run time via a resource manager (RM) that relieves application developers from the management of FPGA resources. The RM can preload FPGA configurations by utilizing its knowledge of application flow graphs. Simulation results show that, even though there is overhead associated with using the resource manager, the concurrency supported by the system can drastically speedup application execution. As programs such as MATLAB use libraries of functions to improve programmer productivity, one advantage of the proposed dynamic reconfiguration system is that it can support a library of FPGA functions, say, one for a 2D convolution, one for a histogram equalization, and so on. With the system, there is no need to squeeze all of the FPGA functions used by a program into the hardware resources at

the same time. Such an environment would allow programmers to enjoy the performance benefits of the adaptive computing technology without worrying about the FPGA design details and would accelerate the adoption of the technology. Future research is necessary to port the RM to other FPGA boards that may not be as modular as the G900 board. In that case, handling the asymmetry in hardware resource units is a very challenging problem. Another issue in dynamic reconfiguration is to design a similar RM for systems that support partial reconfiguration. The virtual hardware manager, as developed in the RAGE project [3], can probably be integrated with the resource manager in this paper so that not only concurrent applications are supported but also the FPGA function density are improved with partial reconfiguration. In [10], several models of DPGA program execution are presented. One of them is on demand usage, which is similar to the proposed system. The paper did not pursue the model, but it claimed, Although it may seem a rather futuristic scenario, there are good reasons for believing that in the fields of multimedia, communications, databases, and cryptography at least, the characteristics of the applications themselves are likely to demand this sort of highly flexible execution environment.

ACKNOWLEDGMENTS
This research is supported by DARPA under U.S. Air Force contract number F33615-97-1-1148, an Ohio State investment fund, and an Ohio State research challenge grant. Xilinx Inc. donated an FPGA design tool and FPGA chips on XMODs.

REFERENCES
[1] [2] [3] J.G. Eldrege and B.L. Hutchings, Run-Time Reconfiguration: A Method for Enhancing the Functional Density of SRAM-Based FPGAs, J. VLSI Signal Processing, vol. 12, pp. 67-86, 1996. J.D. Hadley and B.L. Hutchings, Designing a Partially Reconfigured System, FPGAs for Fast Board Development and Reconfigurable Computing, Proc. SPIE 2607, pp. 210-220, 1995. J. Burns, A. Donlin, J. Hogg, S. Singh, and M. Wit, A Dynamic Reconfiguration Run-Time System, Proc. FCCM, pp. 66-75, 1997.

602

IEEE TRANSACTIONS ON COMPUTERS,

VOL. 48,

NO. 6,

JUNE 1999

[4] [5] [6] [7] [8] [9] [10] [11] [12] [13]

[14]

[15] [16] [17] [18]

MPEG Software, Simulation Group at: http://www.mpeg.org/ MSSG/. R. Cook, J. Jean, and J.S. Chen, Accelerating MPEG-2 Encoder Utilizing Reconfigurable Computing Proc. CERC/VIUF and IEEE Computer Society Workshop, Univ. of Dayton, Dec. 1997. M.J. Gonzalez, Deterministic Processor Scheduling, Computing Surveys, vol. 9, no. 3, pp. 173-204, Sept. 1977. S. Ha and E.A. Lee, Compile-Time Scheduling of Dynamic Constructs in Dataflow Program Graphs, IEEE Trans. Computers, vol. 46, no. 7, pp. 768-778, July 1997. K. Hill, L. Foti, D. Zebe, and B. Box, Real-Time Signal Preprocessor Trade-Off Study, Proc. IEEE Nat'l Aerospace and Electronics Conf., pp. 328-335, May 1995. W.E. King, T.H. Drayer, R.W. Conners, and P. Araman, Using MORRPH in an Industrial Machine Vision System, Proc. IEEE Symp. FPGA Custom Computing Machines, pp. 18-26, 1996. I Page, Reconfigurable Processors, invited keynote address for the Heathrow PLD Conf., Apr. 1995. B. Taylor and D. MacLeod, X Language Doc., Giga Operations Corp., 1994. V. Yavagal, A Resource Manager for Configurable Computing Systems, MS thesis, Wright State Univ., July 1998. J. Villasenor, B. Schoner, K. Chia, C. Zapata, H. Kim, C. Jones, S. Lansing, and B. Mangione-Smith, Configurable Computing Solutions for Automatic Target Recognition, Proc. IEEE Symp. FPGA Custom Computing Machines, pp. 70-79, 1996. A. Rashid, J. Leonard, and W.H. Mangione-Smith, Dynamic Circuit Generation for Solving Specific Problem Instances of Boolean Satisfiability, Proc. IEEE Symp. Field-Programmable Custom Computing Machines, Apr. 1998. T.A. Sudkamp, Languages and Machines, pp. 351-353. AddisonWesley, 1988. M.J. Wirthlin and B.L. Hutchings, Sequencing Run-Time Reconfigured Hardware with Software, Proc. ACM/SIGDA Int'l Symp. FPGAs, pp. 122-128, 1996. XLINK-OSProgrammer's Reference Manual. Giga Operations Corp., 1996. P. Zhong, M. Martonosi, P. Ashar, and S. Malik, Accelerating Boolean Satisfiability with Configurable Hardware, Proc. IEEE Symp. Field-Programmable Custom Computing Machines, Apr. 1998.

Karen Tomko earned her doctorate degree at the University of Michigan in 1995. She received the MSE and BSE degrees from the University of Michigan in 1992 and 1986, respectively. She has been an assistant professor at Wright State University, Dayton, Ohio, since January of 1996. From 1986-1992, she developed image processing and system software for Synthetic Vision Systems and the Environmental Research Institute of Michigan. Her research interests are adaptive computing systems, parallel computing, application optimization, and graph partitioning. She is a member of the IEEE Computer Society and the ACM. Vikram Yavagal received the BS degree in electronics engineering from the University of Pune, India, in 1995 and the MS degree in computer engineering from Wright State University, Dayton, Ohio, in 1998. His main interests are in reconfigurable computing systems, operating systems, and distributed computing systems.

Jignesh Shah received his BE degree in electronics from the University of Bombay in 1996. He received his MS degree in computer science from Wright State University, Dayton, Ohio, in December 1998. His major interests are in networking, reconfigurable environments, and programming language paradigms. He is also working as a consultant for enterprise-wide client-server applications. He is a student member of the IEEE.

Jack Shiann-Ning Jean received the BS and MS degrees from the National Taiwan University, Taiwan, in 1981 and 1983, respectively, and the PhD degree from the University of Southern California, Los Angeles, in 1988, all in electrical engineering. In 1989, he received a research initiation award from the U.S. National Science Foundation. Currently, he is an associate professor in the Computer Science and Engineering Department of Wright State University, Dayton, Ohio. His research interests include parallel processing, reconfigurable computing, and machine learning. He is a member of the IEEE.

Robert R. Cook received a BS in systems analysis in 1976 and an MBA from Miami University, Oxford, Ohio, in 1986. He received an MS in computer science from Wright State University, Dayton, Ohio, in 1997. Cook has held various positions in the computer industry. He has extensive knowledge of various hardware and software platforms from PCs through mainframes. He is currently pursuing his PhD in computer science and engineering at Wright State University, where his interests are operating systems and reconfigurable computing.