A Genetic Algorithm For The Design Space Exploration of Datapaths During High-Level Synthesis

IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 10, NO.
3, JUNE 2006
213
A Genetic Algorithm for the Design Space Exploration of Datapaths During High-Level Synthesis
Vyas Krishnan and Srinivas Katkoori, Senior Member, IEEE
AbstractHigh-level synthesis is comprised of interdependent tasks such as scheduling, allocation, and module selection. For todays very large-scale integration (VLSI) designs, the cost of solving the combined scheduling, allocation, and module selection problem by exhaustive search is prohibitive. However, to meet design objectives, an extensive design space exploration is often critical to obtaining superior designs. We present a framework for efcient design space exploration during high-level synthesis of datapaths for data-dominated applications. The framework uses a genetic algorithm (GA) to concurrently perform scheduling and allocation with the aim of nding schedules and module combinations that lead to superior designs while considering user-specied latency and area constraints. The GA uses a multichromosome representation to encode datapath schedules and module allocations and efcient heuristics to minimize functional and storage area costs, while minimizing circuit latencies. The framework provides the exibility to perform resource-constrained scheduling, time-constrained scheduling, or a combination of the two, using a simple and fast list-scheduling technique. A graded penalty function is used as an objective function in evaluating the quality of designs to enable the GA to quickly reach areas of the search space where designs meeting user specied criteria are most likely to be found. Since GAs are population-based search heuristics, a unique feature of our framework is its ability to offer a large number of alternative datapath designs, all of which meet design specications but differ in module, register, and interconnect congurations. Many experiments on well-known benchmarks show the effectiveness of our approach. Index TermsDatapath synthesis, design space exploration, genetic algorithms (GAs), high-level synthesis.
Fig. 1. Silicon compilation through high-level synthesis.
I. INTRODUCTION
HE IMPRESSIVE proliferation of highly complex digital signal processing very large-scale integration (VLSI) circuits in many of todays electronic products is mainly fueled by breakthroughs in efcient design techniques developed over the past two decades. There is a growing consensus among VLSI designers that one of the most effective methods to handle the complexity of todays system-on-chip (SoC) designs is to use computer-aided design (CAD) techniques that start with an abstract behavioral or algorithmic description of a circuit and automatically synthesize a structural description of a digital circuit that realizes the behavior. The behavioral description con-
Manuscript received March 18, 2004; revised January 25, 2005. This work was supported in part by the National Science Foundation under the Faculty Early Career Development Program under Grant CCR-0093301. The authors are with the Department of Computer Science and Engineering, University of South Florida, Tampa, FL 33620 USA (e-mail: krishnan@eng.usf.edu; katkoori@csse.usf.edu). Digital Object Identier 10.1109/TEVC.2005.860764
sists of algorithmic statements containing computational operations (additions, mutiplications, comparisons, logical operations) and control operations (conditional statements, loops, and procedure calls). The structural description maps the operations and data transfers onto funtional units in a datapath and a control unit that coordinates the ow of data between various functional units of the datapath. The datapath is comprised of hardware units (ALUs, multipliers, logical gates), storage units (registers, registers les, RAM, ROM), and interconnect units (multiplexers, buses) that are connected together to realize the specied behavior. This structural description is called a register transfer (RT)-level description. Once an RT-level design of a circuit is obtained, it can be transformed into a logic gate level netlist through logic synthesis, then into a layout via layout synthesis, and nally fabricated into an integrated circuit. Fig. 1 illustrates a typical high-level synthesis ow used for creating a chip design, starting from an abstract algorithmic specication. This process of transforming a behavioral or algorithm-level description into a synthesizable structural description affords a
1089-778X/$20.00 2006 IEEE
214
IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 10, NO. 3, JUNE 2006
Fig. 2. Example of behavioral description and its dataow graph.
methodology of automatically synthesizing a realizable digital circuit from an abstract algorithmic specication of the design, thus considerably reducing the design cycle time. VLSI designs are multiobjective by nature, since they have to tradeoff several conicting design objectives such as chip area, circuit delays, and power dissipation. The shorter design time using behavioral synthesis allows one to examine many alternative circuit realizations during the design process. A. High-Level Synthesis The term high-level synthesis is used to describe the process of transforming an algorithmic or behavioral description into a structural specication of the architecture realizing the behavior. The algorithmic description species the input and output behavior of the algorithm in terms of operations and data transfers, independent of any implementation-level details. Most highlevel synthesis systems represent the behavioral description in the form of a directed acyclic graph, called a dataow graph (DFG). A dataow graph identies the operations (tasks) in the behavioral description and all the dependencies between them. The vertices of the dataow graph represent these operations, with directed edges identifying the dependencies between them. Fig. 2 gives an example of a dataow graph for the behavioral description of an example from [6]. The goal of a high-level synthesis system is to convert the abstract specication contained in a dataow graph into a structural specication that maps the operations and data transfers represented by the vertices and edges, respectively, of the dataow graph onto hardware functional units, storage
units, and their interconnections. Often, the structural specication is divided into a datapath comprised of the functional and storage units and a control unit that coordinates the ow of data between the datapath elements. Due to this division, high-level synthesis is traditionally divided into datapath synthesis and controller synthesis. In this paper, we primarily focus on datapath synthesis. Datapath synthesis consists of three interdependent subtasks: scheduling, allocation, and binding [1], [2]. Fig. 3 depicts the sequence of subtasks performed in a traditional high-level synthesis process. The scheduling subtask maps a set of operations in the algorithmic description onto a set of discrete time steps, such that all data dependencies/precedence constraints specied in the algorithmic description are met. Scheduling xes the order in which operations in the input specication will be executed. The objective of the scheduling step is to map all the operations to time steps such that the total number of time steps required to implement the specied behavior, called schedule length, meets the timing constraints for the synthesized hardware and minimizes implementation area. The scheduling subtask is an important step during high-level synthesis, since it determines the speed and throughput of the synthesized circuit. There are two variations of scheduling commonly used during high-level synthesis: time constrained (TC) scheduling and resource constrained (RC) scheduling [1]. In TC scheduling, an upper bound on the schedule length is specied, and the goal of the scheduler is to minimize the number of functional units needed to realize the behavior. In RC scheduling, an upper bound on the number of functional units available is specied, and the goal of the scheduler is to minimize the schedule
KRISHNAN AND KATKOORI: GENETIC ALGORITHM FOR DESIGN SPACE EXPLORATION OF DATAPATHS
215
TABLE I EXAMPLE OF SCHEDULING, ALLOCATION, AND BINDING FOR DATAFLOW GRAPH IN FIG. 2
The table also shows a possible binding of the dataow graph operations to the allocated functional resources. B. Motivation for This Work The domain of VLSI design is multiobjective in nature, often with the need to tradeoff several conicting design objectives such as chip area, circuit speed, power dissipation, and manufacturing cost. For many VLSI designs, speed and cost requirements vary over different market segments of the target application. High-end segments, such as server CPUs and hardware router ASICs, demand high performance, while low-end but high-volume markets such as ASICs for cell phones demand low cost and power consumption (but not necessarily high performance). Hence, depending on the application, there is a strong motivation to explore the design space of alternative circuit implementations before nalizing one that meets all design tradeoffs. Design space exploration has a higher payoff when done during high-level synthesis than at lower levels of abstractions such as logic or transistor levels. Design space exploration in high-level synthesis involves evaluating different operator schedules, module allocations, and module binding alternatives to a given design specication. Datapath synthesis can be modeled as the process of searching a complex multidimensional space represented by the set of possible schedules, allocations, and bindings that can realize a given behavioral specication. Under latency and resource constraints, the subtasks of scheduling, allocation, and binding are known to be NP-hard [3]. There is a strong interdependence between these synthesis subtasks, and it is not clear in which order they should be performed [1]. For example, the allocated number of hardware units constrain the maximum number of concurrent operations that can be scheduled, but the minimum number of hardware units that need to be allocated to meet design timing constraints cannot be known until the scheduling subtask is completed. In addition, decisions taken in any of the subtasks signicantly impact decisions taken in subsequent subtasks, often determining the quality of the solutions found. To simplify the synthesis task, many high-level synthesis systems start with some assumptions (such as an upper bound on the number of functional units available) and perform the synthesis subtasks sequentially. Many systems typically start with scheduling, followed by allocation, in that order. However, perfoming these subtasks independently may miss the optimal design altogether, and the best strategy is to perform these synthesis subtasks concurrently [2]. As modern VLSI and SoC designs become more complex, a major problem is the extremely large number of possible schedule and allocation combinations that must be examined
Fig. 3. Interdependence of subtasks in high-level synthesis.
length with the available functional units. Scheduling under resource constraints is important in VLSI since resource usage determines the circuit area of resource dominated circuits and often represents a signicant portion of the overall area of a circuit. Allocation determines the amount of hardware resources (in terms of functional, storage, and interconnect units) that can be shared by operations and data transfers specied in the behavioral description. Since total chip area is an important constraint in any VLSI design, resource allocation plays a vital role in determining the total area occupied by a synthesized circuit. While allocation determines the amount of resources needed in a synthesized design, binding determines the actual mapping of operations to specic functional units and data transfers to storage units in the synthesized stuctural design. Hence, the binding subtask, to a large extent, determines the interconnect resources required to implement a design. As an example, the dataow graph in Fig. 2 could be scheduled in four time steps as shown in Table I. Note that for this schedule to be feasible, the allocation subtask must allocate a minimum of four multipliers, one ALU, and one comparator.
216
in order to select a design that meets constraints and is optimal. This process, called design space exploration, is further compounded by the need for shortening design times due to time-to-market pressures. Since an exhaustive search could be prohibitive and an ad hoc design exploration could be inefcient, designers often select a conservative architecture after some experimentation, which often results in a suboptimal design. Given this scenario, there is an acute need for techniques that automate the efcient exploration the large space in a reasonable time, during high-level synthesis of datapaths. In this paper, we present a genetic algorithm (GA)-based technique for performing the combined subtasks of scheduling and allocation in high-level behavioral synthesis. Our approach is motivated by the fact that the search space for high-level synthesis is large and discrete, and GAs are known to work well on such problems [28], [36]. In addition, the inherent parallelism of GAs allows efcient design space exploration during a single optimization iteration. Our approach uses a GA to concurrently search the space of datapath schedules and module/storage allocations. It combines an indirect chromosome representation with efcient heuristics to derive a solution from the chromosome encoding. A multichromosome representation is used to encode a topologically ordered list of scheduled tasks and a resource list containing the number of functional units allocated. An efcient list-scheduling heuristic is used to derive a schedule from the task list, and a left-edge heuristic is used to determine the storage requirements for the datapath from the derived schedule. The paper is organized as follows. Section II discusses related work on various subtasks of the high-level synthesis problem. In Section III, we elaborate our GA-based approach to solve this problem; we discuss the encoding strategies, various crossover and mutation operators, initialization and selection strategy, the heuristics used to interpret the encodings, and the tness function used to evaluate solutions. In Section IV, we describe the experimental setup, the benchmark problems, and experimental results. In Section V, we provide an analysis and discussion of our GA-based high-level synthesis technique, and nally in Section VI we summarize the paper and draw conclusions based on our experimental results. II. RELATED WORK A. Previous Work on High-Level Synthesis A large number of scheduling and allocation techniques have been developed for high-level synthesis. Most of the early high-level synthesis systems performed the synthesis subtasks of scheduling and allocation seperately. For example, these systems typically performed scheduling rst, followed by functional unit and memory resource allocation. However, there is a strong interdependence between these synthesis subtasks, and there is no clear consensus on which order they should be performed [1]. Decisions taken in one of the subtasks constrain decisions taken in a subsequent subtask and often have a large impact on the quality of the solutions found. Recognizing this interdependence, several techniques have been proposed that simultaneously perform scheduling and allocation for design space exploration in high-level synthesis. Several researchers
have applied GAs to the problem of simultaneously considering the subtasks of scheduling and resource allocation during high-level synthesis [17], [24][26], [30], [39]. Scheduling techniques used in high-level synthesis can be broadly classied into four types: constructive approaches, iterative transformational approaches, probabilistic approaches, such as simulated annealing, and exact approaches, such as integer linear programming. The constructive approach schedules one operation or control step at a time until all operations are scheduled. Important algorithms taking this approach include as-soon-as-possible (ASAP) scheduling [2], as-late-as-possible (ALAP) scheduling [2], list scheduling [2], critical path scheduling [7], force-directed scheduling [6], and path-based scheduling [8]. After the operations are scheduled, allocation of functional unit and storage resources is usually performed. Many of these constructive techniques use greedy heuristics such as list scheduling [3][5] and force-directed scheduling [6]. Constructive approaches have the advantage of being fast and simple to implement. However, being greedy, they are vulnerable to local minima in the design space, leading to poor quality of solutions despite the use of sophisticated heuristics. Transformational synthesis algorithms, on the other hand, start with default schedules and then repeatedly apply semantic preserving transformations to improve the initial schedules. The Yorktown Silicon Compiler system [9] starts with a maximally parallel schedule which is then serialized to satisfy the resource constraints. The CAMAD design system [10], on the other hand, starts with a maximally serial schedule and attempts to parallelize it to reduce the schedule length. These systems attempt to escape local minima in the design space by incorporating suitable hill-climbing heuristics. However, the quality of solutions found by transformational techniques depends, to a large extent, on the variety and effectiveness of the transformations used, as well as the heuristics used to select between applicable transformations. Mathematical approaches formulate the synthesis problem as an integer linear programming (ILP) problem [11], [12]. These approaches give optimal solutions, but their computational complexity is exponential, resulting in very large run times for realistic design problems. Probabilistic techniques such as simulated annealing (SA) and simulated evolution (SE) have also been applied to the high-level synthesis problem. The simulated annealing technique presented in [13] simultaneously solves the subtasks of scheduling, function-unit allocation, and register minimization. SALSA [14], [15] is another SA-based approach that uses an extended binding model and several probabilistic search operators to improve performance of an SA-based approach to high-level synthesis. The approach by Ly and Mowchenko [16] applies simulated evolution to the simultaneous tasks of scheduling and resource allocation in high-level synthesis. Though these probabilistic methods produce good quality solutions, their run times are often large and increase rapidly with the size of the problem. B. GA-Based Approaches to High-Level Synthesis The search space of the combined scheduling/allocation problem in high-level synthesis is highly multimodal, with
217
many widely seperated local minima. Searching the complex solution space of this problem requires a tradeoff between two apparently conicting objectives: exploiting the current best solutions in the search space and efciently exploring the search space. GAs offer several advantages over other techniques in the area of design space exploration for high-level synthesis. GAs are effective in intelligently handling the tradeoff between the exploration and explotation of solutions in the search space, which is an essential ingredient for efcient design space exploration in high-level synthesis. In addition, since GAs are population-based search techniques, one advantage of using GA-based synthesis is the ability to produce multiple designs that satisfy user specied design constraints. These multiple designs may vary in circuit speed or chip area and have the potential to offer greater exibility in the subsequent stages of the design process. Several high-level synthesis systems have used GAs to perform some or all of the synthesis subtasks. More recently, GAs have also been applied to design space exploration at the system-on-a-chip level [43], and at the microcode level of instruction set processors [44]. In all these GA-based approaches, a population of solutions to the synthesis problem is iteratively improved through the application of genetic operators to selected individuals in the population. These systems mainly differ in their chromosome representations and the genetic operators used to search the solution space. In ADaPaS [25], a GA is applied to the scheduling and allocation subtasks of datapath synthesis. In that paper, the authors use a GA to search for a tradeoff between resource allocation and schedule length by reweighing a cost function. The method is based upon assigning a displacement to each of the operations to be scheduled and construct a topologically ordered schedule, using a modied ASAP algorithm. Resource constraints are used to defer operations when all resources in the current cycle considered for scheduling are occupied. In their chromosome representation, each operation is represented as its absolute displacement from its ASAP time frame. They use simple crossover and mutation operations and do not need any specialized genetic sequencing operators. The main disadvantage of their chromosome representation is that displacement of operations in the critical path of a data-ow graph has a signicant impact on the quality of schedules generated by the GA. In [25], special initialization methods are presented to seed the initial population with schedules within their time constraint by distributing displacements over critical paths. Though an improvement in the quality of solutions is reported with this seeding technique, the simple crossover operators used in their GA do not preserve the displacements of critical path operations during the run of the GA, producing suboptimal schedules. In PSGA [24], a problem-space GA is applied to the design space exploration of datapaths. The authors approach is based on the use of a problem-specic chromosome representation where each operation is assigned a priority (called work remaining) based on the length of the longest path from the node to the output. They use the concept of a heuristic/problem pair to map each chromosome to a valid schedule. A problem-specic heuristic, called the work remaining heuristic, is used to decode each chromosome into a schedule and a module allocation. Their chromosome representation enables them to use
simple crossover and mutation operators to generate valid solutions. However, with their chromosome representation, the number of unique chromosomes that map to the same solution is large, making the search space large, even for problems of modest size. In [26], Heijlingers et al. propose a GA-based technique for time-constrained scheduling. The authors use a permutation of operations to represent a chromosome, and use a list scheduler to decode each of the chromosomes into a valid schedule. Since many operations in an input dataow graph typically have precedence constraints, and hence can be scheduled only after all their predecessor operations are completed, the use of random permutations to represent schedules results in an exponetially large number of permutations mapping to the same schedule. This increases the size of the search space proportionately, thus slowing the GA search. In GABIND [27], a GA is applied to the allocation and binding phases of a high-level synthesis. The approach uses an unconventional crossover mechanism relying on a force directed datapath binding completion algorithm. A bus-based interconnection scheme and the use of multiport memories are two of the key features of the system. The system, however, does not handle scheduling of operations and assumes a scheduled data-ow graph as its input. Finally, in [17] and [36], the authors describe a GA-based high-level synthesis system that uses a binary encoding for the chromosomes, as opposed to problem-specic representations such as in [24], [25], and [27]. A binary encoding is used to store information on the the control step assigned to each operation in the data-ow graph, and the functional module assigned to the operation. An inherent disadvantage of using a binary encoding is that the size of the chromosome increases exponentially with the size of the problem, with a corresponding increase in the size of the search space, leading to large run times for realistic problem sizes. In this paper, we propose a GA-based technique that addresses some of the deciencies of earlier GA-based approaches to high-level synthesis. Our GA is designed to perform the combined subtasks of scheduling and allocation in high-level behavioral synthesis and concurrently search the space of data path schedules and module/storage allocations. Our approach is motivated by the fact that the search space for high-level synthesis is large and discrete, and GAs are known to work well on such problems [28]. In addition, the inherent parallelism of GAs allows efcient design space exploration during a single optimization iteration. Our GA incorporates three features that enable it to efciently perform design space exploration during high-level synthesis. First, a multichromosome encoding is used to concurrently handle a search in the space of schedules and allocations for an input behavioral specication. A novel encoding is used to specify datapath schedules. This encoding, designed to improve the efciency of the GA search for design space exploration, uses a chromosome representation that encodes the precedence relationships among the tasks in the input behavioral specication with a topological order-based representation to specify schedule priorities. This alleviates the many-to-one relationship that exists between the genotype and phenotype encoding in other GA-based high-level synthesis systems. This chromosome
218
Fig. 4. Proposed framework for high-level synthesis of datapaths.
representation enables the GA to prune the search space, increasing search efciency. Second, our GA uses an efcient list-scheduling heuristic to decode chromosomes into valid schedules. The list-scheduling heuristic is not limited to using a single xed priority function as in other list-scheduling approaches, but instead uses adaptive task priorities that are dynamically decided by the GA. This enhances the robustness of our GA, enabling the scheduler to avoid local optima. Third, our GA concurrently performs register minimization in the datapath, in addition to functional allocation. A variant of the left-edge algorithm is used to determine the register storage requirements for each schedule. Since the cost of registers, in terms of chip area, can be signicant, our GA-based framework aims to minimize the number of registers needed in a synthesized datapath.
Fig. 5. Structure of GA.
A. Overall GA Structure Fig. 4 gives an overview of the proposed framework used for high-level synthesis. The input to the GA includes a description of the data-ow graph (DFG) representation of the behavioral description of the datapath, a set of design constraints, user control parameters, and specications (area/speed) of the functional and storage modules of a design library. The user constraints could include resource and/or latency constraints. User control parameters could include cost function-related parameters, such as performance-area tradeoff weights, and GA-related parameters, such population size, maximum number of evaluations, and probabilities of application of genetic operators (crossover and mutation). User control parameters could be optionally preset once and stored in a le that the GA would read each time it is run, or it could be tuned by the user for each problem instance. The overall structure of the GA used in our high-level synthesis framework is shown in Fig. 5. The GA used is a steadystate GA based on the classication in [23]. The iterative loop of the algorithm starts with an initial population of individuals created randomly to ensure a uniform sampling of points in the search space as a starting point. In our version of the steady-state GA, in each generation two parent individuals are selected from the population with a probabililty proportional to their tness,
III. DESIGN OF GA Our high-level synthesis system employs the robust search capabilities of a GA to concurrently solve the datapath synthesis subtasks of scheduling and allocation of resources (functional units such as ALUs and storage units such as registers), with the aim of nding schedules and module/storage combinations that lead to superior designs, while considering user-specied latency and area constraints. The GA uses a multichromosome representation to encode datapath schedules and functional unit allocations and efcient heuristics to minimize register and interconnect costs. The proposed synthesis framework provides the exibility to perform resource-constrained scheduling, time-constrained scheduling, or a combination of the two. The various design choices used in our GA are described in this section, with the assumption that the reader has knowledge of the basic ideas and general structure of GAs.
219
Fig. 6. Chromosome encoding used in GA.
and two new individuals (i.e., offspring) are created from these through the application of genetic operations, which are then immediately added to the population. Binary tournament selection [22] was seen to perform well during our preliminary experiments and was used as the selection operator for the two parent individuals. To ensure sufcient diversity in the population at all times, duplicate individuals (i.e., individuals representing identical solutions) are not allowed in the population. Our experiments indicated that duplicate avoidance results in signicant improvements in performance over those that allow duplicates. Hence, before every offspring is added to the population, a duplicate check is performed. In a duplicate check operation, the genome representing the offspring is compared with that of every member in the GA population. If the genome of the offspring is identical to that of an existing population member, the offspring is discarded, else it is added to the population. An offspring that survives the duplicate check is introduced in the population only if it is better than the current worst member in the population, in which case the offspring replaces the worst member. This makes our GA elitist in nature, and ensures that search by the GA monotonically progresses toward optimal regions of the search space. The duplicate avoidance policy adopted in our GA ensures that every population member is unique in the genotype space. Our experiments have shown that doing a duplicate check on a newly created offspring before it is introduced in the population enables the GA to maintain population diversity throughout the GA run and reach highly t areas of the solution space efciently. Each population member is comprised of two strings; one of them is a topologically sorted permutation of tasks, and the other a list of the number of hardware functional units allocated to implement the schedule. A topologically sorted permutation is used to ensure that all precedence constraints among tasks in the input data-ow graph are maintained in each chromosome representation. Our GA applies genetic operators to both the strings independently, enabling the GA to concurrently search
the spaces of schedules and resource allocations. Problem-specic heuristics are used as decoders to process the two strings representing an individual and to determine the schedule lengths and the amount of functional and storage units needed to implement the schedule. The initial population consists of random permutations of tasks that are topologically correct in preserving the precedences in the input behavioral specication and a random number of allocated functional resources. B. Problem Encoding A suitable problem encoding must be chosen to ensure that both scheduling and allocation information could be represented, and suitable genetic operators could be designed to enable the GA to reach optimal or near-optimal solutions. Our GA uses a multichromosome representation, comprised of two independent substrings, to encode each individual in the GA population. One of the substrings, called the node priority eld, is used to encode a topologically sorted permuted list of tasks to be scheduled. The node priority eld is used as an indirect representation of the nal schedule, which is obtained by applying a list-scheduling heuristic to the node priority eld. Using this combination of a permuted list of tasks and a list scheduler, the heuristic ensures that every chromosome in the population results in a feasible schedule. The other substring, called the resource allocation eld, consists of a list of integers specifying the maximum number of functional units of each type available for scheduling in every time step of the schedule. This encoding scheme always gives an active schedule and ensures that for each active schedule, there exists an appropriate corresponding permuted sequence of tasks. Fig. 6(b) illustrates an example of the encoding used for an IIR lter dataow graph shown in Fig. 6(a). A schedule builder is used to decode the node priority eld to derive a valid schedule, subject to the resource constraints specied by the resource allocation eld of the chromosome representation. The schedule builder used in our system is a list-scheduling
220
Fig. 7. One-point topological crossover.
heuristic which binds ready operations to available functional units in the order they appear in the node priority eld of the chromosome. An operation in the node priority eld is said to be ready in a given time step if all its predecessor nodes have completed execution before the time step, and a functional unit which supports the operation is available at the time step. Since every permutation of operations could be decoded into a valid schedule, this representation ensures that the GA can reach every part of the solution space, including the optimal solution. The resource allocation eld ensures that every possible combination of functional resources can be represented. The individuals in the initial population contain random permutations of operations that are topologically correct in preserving the precedences in the input behavioral specication, together with random combinations of allocated functional units chosen between the limits {1 upper bound}, where the upper bound on the number of functional units can be either user specied or computed from the ASAP schedule [2] of the input algorithmic specication. Since the GA searches for priority lists that maximize the cost function, by traversing a search space induced by permutations of operations, it has the ability to nd optimal or near-optimal schedules. It can be similarly argued that by traversing a search space induced by integers representing all possible combinations of functional units allocations, the GA has the ability to nd optimal or near-optimal allocations of functional units. C. Crossover Operators The two substrings of the chromosome (node-priority eld and resource-allocation eld) have their own crossover operators. In our GA, the crossover operators for the two substrings are invoked independent of each other (i.e., the crossovers acting on the two substrings are not necessarily performed together during each crossover event). Since the node priority eld encodes topologically sorted permutations of tasks with precedence constraints, a crossover
operator for this substring that preserves precedence relationships has been developed; i.e., the permutation of tasks in the offspring also preserve the precedence constraints imposed by the data-ow graph. This ensures that every offspring created during crossover results in a valid schedule. This crossover operator is similar to a one-point crossover and can be described as follows. 1) One-Point Topological Crossover: A crossover operator that preserves the topological constraints among the tasks in the node-priority eld is described in Fig. 7. In this description, parent- and parent- refer to the two parents, while offspringoffspring- refer to the two offspring created by the crossover operator. The crossover will be illustrated with reference to the example data-ow graph shown in Fig. 6. In applying this operator, the parent chromosomes are divided randomly into two sections. For the rst section, the offspring inherits both the positions and the order of the tasks from one parent. For the second section, it inherits the order of the rest of the tasks from the other parent. Since the precedence relationships are maintained in both parents, it is also maintained in the rst section of the offspring. Since the tasks in the second section of the offspring all follow the tasks of the rst section, the order of the tasks inherited from the second parent also maintains all precedence relationships among the tasks specied in the input data-ow graph. Using the example dataow graph from Fig. 6, consider parentparentthen offspringoffspring-
221
2) Crossover Operation for Resource Allocation Substring: The resource-allocation substring of the chromosome encodes the number of hardware functional units of each type available for scheduling operations in each time step. Since the number of allocated functional units of each type is independent of each other, standard one-point crossover can be applied to this substring. Consider, for example, two parent substrings parent- and parent- , where parentrepresents a solution containing four adders, two multipliers, and three subtractors, while parentrepresents a solution containing two adders, three multipliers, and one subtractor parentparentApplying a one-point crossover to these at a randomly selected cut-point shown above creates the two offspring solutions offspring- and offspringoffspringoffspringHere, the crossover creates one solution containing four adders, three multipliers, and one subtractor and another with two adders, two multipliers, and three subtractors. D. Mutation Operators As in the case of crossover operators, the node-priority and resource-allocation substrings have their own independent mutation operators; and just as for the crossover, the mutation operators are invoked independent of each other (i.e., the mutation operators acting on the two substrings are not necessarily performed together during each mutation event). 1) Mutation Operation for Node Priority Substring: The one-point topological crossover operator described above produces valid offspring from valid parent chromosomes. However, since they preserve the topological order of tasks, they preserve many of the subsequences that are common to both parents. This could lead to a loss of diversity in the population, leading to premature convergence. An effective mutation operator is needed to continually reintroduce diversity in the population and also enable the GA to escape from local minima. In our GA, we used a variation of the shift mutation for permutation representations, which is modied to preserve all the precedence relationships in a string. Consider a string containing a topologically sorted permuted list of tasks, and let be the position of some task in the string. Since precedence relations specied in the input dataow graph are preserved in the string, all tasks appearing in correspond to tasks that either precede task positions before in the input dataow graph or have no precedence constraints with . Likewise, all tasks appearing in positions located after correspond to tasks that either succeed in the input the input dataow graph or can be scheduled independently of . Note that each of the tasks in the input dataow graph have a maximum of two predecessor tasks (since all data-ow graph
Fig. 8. Precedence preserving shift mutation.
operations considered have a cardinality of two) but may possibly have many successor tasks. Let PMAX denote the maximum position of a predecessor task of in the string , and let SMIN denote the minimum position of a successor task of in the string , then the task can be moved anywhere in string between positions PMAX and SMIN without violating any precedence constraints. The mutation operator, called the precedence preserving shift mutation, is designed to preserve the precedence constraints specied in the node priority eld of a chromosome while perturbing the priority list to create a new solution. This is done by randomly shifting the position of a task in the node priority eld of the chromosome between the tasks corresponding PMAX and SMIN positions in the node priority eld. The precedence preserving shift mutation operation is illustrated in Fig. 8. Using the example dataow graph from Fig. 6, consider
Let the task picked for shifting be 8. For this task, the predecessor task in the DFG is 3 and successor task is 9; therefore, PMAX and SMIN . We can see that task can be shifted to any position from to without violating any precedence constraints in the dataow graph. If, for , the resulting example, is randomly shifted to position string becomes
2) Mutation Operation for Resource Allocation Substring: This mutation operator provides a means to perturb the number of functional units allocated in a solution. Perturbing the allocated amounts of many of different resource types in the same mutation operation might result in high disruption of t schemata present in a substring. To minimize this, only one randomly chosen resource is altered in any mutation event. The mutation operation for the resource allocation eld is implemented, as shown in Fig. 9. In Fig. 9, represents the resource allocation substring of the chromosome, where for represents the number each functional resource of type , of units of the resource available. E. Fitness Evaluation The objective of our GA is to nd solutions to a high-level synthesis problem that minimizes the schedule length (i.e., latency) and the number of functional and storage units required to implement the schedule, while meeting user-specied constraints (in terms of latency and overall area). An indirect representation (topologically sorted permuted list of tasks) is used in the chromosomes for the schedules to avoid producing infeasible schedules during crossover and mutation. A list-scheduling heuristic is used to decode the permuted list of tasks into
222
Fig. 9. Resource allocation substring mutation.
resource allocation eld
Fig. 10. Overview of tness evaluation procedure.
valid schedules. To search the space of functional module allocations, the resource allocation eld of the chromosome is used by the list scheduler while decoding. The amount of registers that need to be allocated can be computed after all the tasks in the input DFG have been scheduled. The Left Edge algorithm [29] is used as a heuristic to determine the storage requirements from a given schedule. Fig. 10 gives an overview of the tness evaluation procedure. 1) Modied List-Scheduling Algorithm: The modied listscheduling algorithm used in our GA takes as input both the substrings in the chromosome (i.e., node priority eld and resource allocation eld) and creates a valid schedule of tasks. The operation of the list scheduler is similar to a resource-constrained list scheduler [2]. However, in our GA, the functional unit resource constraints are specied by the resource allocation substring of each chromosome, which could be different among the set of chromosomes in a GA population. The basic idea is to use the order of tasks in the node priority substring of the chromosome as a priority list for the list scheduler and use the order of tasks in the substring to provide a selection criterion if operations compete for resources. Since the GA population contains many unique node priority strings, it enables the GA to search the space of priority functions for a list scheduler. Fig. 11 illustrates the modied list-scheduling algorithm used on a node priority substring containing tasks to be scheduled in a given dataow graph. Using the example dataow graph from Fig. 6, consider the following chromosome with the node priority and resource allocation elds: node priority eld
In this example, the resource allocation eld indicates that the datapath has 2 multipliers and 2 adders. For this node priority list, the dataow graph operations are scheduled as shown in Table II. From Table II, it is seen that for this schedule example, the schedule length is 6 controls steps. When the clock cycle time for each control step is 10 nanoseconds, the latency represented by this schedule is 60 nanoseconds. 2) Left Edge Algorithm: In this paper, as in most common in high-level synthesis, all input data are assumed to be latched in registers. Similarly, all data transfers between control steps (which represent intermediate variables in the input algorithmic description) are stored in registers. Finally, all output data are also saved in registers. The number of registers required in a datapath implementation is determined by the maximum number of concurrent data transfers between any two control steps, which in turn is decided by the operation schedule [1], [2]. Hence, the number of registers needed in a datapath can be determined only after operation scheduling is done. The birth time of a data transfer is the control step in a schedule when it is created by a producer. Likewise, the end time of a data transfer is the control step when it is consumed by the last consumer. The left-edge algorithm can be described is the total number of data as shown in Fig. 12, where transfers (including input and output) in the scheduled DFG and represents the number of registers allocated. 3) Objective Function: The objective function used in our system is divided into two subfunctions. Each subfunction evaluates the success of the solution in the minimization of one of the two objectives of the high-level synthesis problem. The individual tness scores for each evaluation function are combined in a weighted function that reects the required objectives of the optimization process. The two subfunctions used in our system correspond to the schedule length and the respective area contributions of the functional units and registers. The objective function can be written as COST and (2) where COST and cost function of solution represented by chromosome; represent weights assigned to contributions of latency and resource constraints, respectively; (1)
223
Fig. 11.
Modied list-scheduling algorithm.
Fig. 12. Left Edge algorithm. TABLE III EXPERIMENT AND GA PARAMETERS
TABLE II EXAMPLE SCHEDULE BY MODIFIED LIST SCHEDULE HEURISTIC
schedule length of solution represented by chromosome; longest schedule among all chromosomes in current generation; maximum area of datapath among all chromosomes in current generation; sum of areas of FUs in datapath represented by chromosome; sum of areas of registers in datapath represented by chromosome. and as shown in the COST equation Normalizing keeps their values within the range [0 1]. The GA allows a user to specify lower and upper bounds on the resource and time constraints that a datapath design must meet. Our system uses a graded penalty function to reduce the search space and to quickly guide the GA to those areas of the search space that are most likely to contain solutions meeting the
user-specied design constraints. This graded penalty function reduces the tness of a solution in proportion to the amount by which a solution fails to meet a constraint. This feature is particularly useful in design space exploration. With the graded penalty function, the new cost function can be written as COST Time Constraint Area Constraint (3)
where COST represents the graded penalty cost and the meanings of other quantities remain the same as before. The weights
224
and are user specied and reect user preferences about the tradeoff between area and latency in the desired solution. 4) Population Size and Stopping Criteria: Much experimentation was carried out to determine an appropriate population size and stopping criteria. Our study showed that for the high-level synthesis problem, a constant population size and a constant number of generations give good average results. However, this does not take into account the size of the problem instance. Making both the population size and the number of generations proportional to the problem size leads to better quality results but means either poor results for the smaller problem instances, if the constants of proportionality are made small, or inordinately large processing times for the larger ones otherwise. A good compromise proved to be setting the population size proportional to the size of the dataow graph and stopping after a maximum of 100 generations. These choices lead, on average, to the best results and the highest degree of robustness.
IV. EXPERIMENTAL RESULTS The proposed GA-based high-level synthesis system has been implemented in the C language on a SUN UltraSPARC 2. To perform a qualitative assessment of our GA, it was tested on a number of DSP benchmarks drawn from high-level synthesis literature. We used two metrics to compare the quality of solutions found by our GA with those of other high-level synthesis systems: 1) the latency of a synthesized datapath, which represents the time required by a synthesized datapath to complete all operations specied in an input dataow graph and 2) the cost in terms of chip area needed to realize the synthesized datapath. We compared our results with those obtained by several nonheuristic techniques (ASAP [2], ALAP [2], FDS [6], and SAM [39]) as well as several GA and non-GA-based heuristic techniques such as PSGA [24], Ly [16], Torbey [36], Heijligers [26], SALSA [14], and OASIC [11]. This section presents results from the design-space exploration of these benchmark examples using our GA-based high-level synthesis technique and a comparison of the solutions with those of other techniques. For all the benchmarks tested, the synthesized designs were assumed to operate with a clock period of 20 ns. We used a 0.35- CMOS module library, where ALUs, multipliers, registers, and multiplexers are implemented as hard macrocells (cells having xed aspect ratio and pin locations). The ALUs have a propogation delay of 6.5 ns, and multipliers have a propogation delay of 15 ns. We assume that the area cost and delay of a pipelined multiplier are the same as those of a nonpipelined multiplier, respectively. In all the experiments, the size of the GA population was set to 100, the crossover probability was 0.90, and the mutation probability set to 0.20. Each of the GA runs was stopped after 10 000 tness evaluations. Table III summarizes the GA parameters used in our experiments. Since GAs are stochastic algorithms, ten independent runs with different random number seeds were performed for each of the benchmark problem instances, and the best solution found by the GA in each of the ten runs was recorded.
For each of the benchmark problems, a design-space exploration was performed by setting different values to the weights and ) corresponding to the delay and area terms of ( the tness function. Assigning different values to these weights allows the tness function to assign different priorities to the area and delay values of a design, enabling an exploration of the space of possible designs that satisfy specied design constraints. In these experiments, we assumed that , and the value of was changed from 0.0 to 1.0, in increments of 0.05. The best solutions found by the GA (with ten independent runs for each benchmark), for different settings of and , are presented in Table IV. Solutions with different and differ in the number of ALUs, multisettings for pliers, and storage registers used, and the number of clock cycles (i.e., latency) required by the datapath to complete all computations with these allocated functional and storage units. In all our experiments, the CPU times required by the proposed GA for nding these solutions ranged from under 10 s (for the smaller benchmark examples) to under 2 min for the larger benchmarks. Our GA-based synthesis system was tested on seven benchmark examples drawn from literature [5], [6], [11], [13], [14], [16]. These benchmarks represent varying problem sizes and solution complexities and provide a representative sample of test problems used to study the effectiveness of our high-level synthesis system. In these experiments, the synthesized designs are assumed to work with a clock period of 20 ns, and each of the functional units (ALUs/multipliers) in the synthesized designs take less than 1 clock cycle to complete an operation. Table IV summarizes the synthesis results for these benchmarks, for a number of circuit latencies and chip areas, representing the best solutions found by the GA for different and . To assess the effectiveness of our settings of GA-based technique in its ability to converge to near optimal solutions, ten independent runs of the GA were performed with different random seeds, and in each of these runs the best solution found was recorded.Table IV shows the overall area of the best solution found by our GA-based synthesis system in these ten runs, for various performance (latency) constraints. The average area of the best solutions found by the GA over these ten runs is also shown in the table. Ideally, an effective GA-based search strategy should converge to the optimal or near-optimal solution in every GA run. Computing the average of the best solutions found over the ten independent runs provides an indication of the robustness of a GA-based search strategy in its ability to converge to the high-tness areas of the search space. In ve of the seven benchmarks tested, the GA found the optimal solution in all ten runs. For the remaining two largest benchmarks, the difference in the tness among the best solutions found by our GA over the ten runs is less than 3.0%. This illustrates that our GA is able to consistently converge to high-quality solutions in every GA run, for all the benchmarks and allows the GA tested. Assigning suitable values to to explore different areas of the delay-speed tradeoff curves and nd solutions that meet a variety of design constraints. Thus, to a high value such as 0.90 causes the for example, setting GA to assign a higher tness measure to solutions with smaller delays (and a correspondingly larger overall area). Conversely, (such as 0.10, for example) results in a a low value for
225
TABLE IV GA RESULTS ON DSP BENCHMARKS (NOTE: ALU DELAY = 6:5 ns, MULTIPLIER DELAY = 15 ns, CLOCK PERIOD = 20 ns)
higher tness measure to be assigned to solutions that have smaller overall area (and slower speeds). These results clearly demonstrate that faster implementations of a design have larger areas since they require more functional/storage modules and that minimizing the overall area results in designs that are slower. Table IV illustrates the typical trend observed in the areadelay tradeoffs of most VLSI designs. High throughput datapath designs normally require more functional/ storage modules and consequently result in circuits with larger areas. In contrast, designs resulting in smaller circuit areas usually have larger latencies. While this trend is broadly consistent for all the benchmarks, the actual shape of the area-delay curves is quite different across different examples. In some benchmarks, a small reduction in the area could result in substantial reduction in the circuit delay; whereas, in others this is much smaller. This illustrates that an effective design-space exploration could payoff signicantly in terms of nal implementation cost. To further assess the performance of our GA-based high-level synthesis system, we compared our results with those obtained
from four different scheduling techniques commonly used in high-level synthesis, namely, the ASAP scheduling [2], ALAP scheduling [2], force-directed (FDS) scheduling [6], and simultaneous scheduling, allocation, and binding (SAM) [39] technique. These scheduling algorithms were tested in a traditional high-level synthesis framework that performs the three synthesis subtasks of scheduling, allocation, and binding independently. The goal of this comparison was twofold: 1) to verify the performance gains from concurrently performing scheduling and allocation, over a traditional synthesis ow that carrys out these subtasks independently and 2) to use the performance of these scheduling techniques as a baseline to compare our results. We used the same benchmarks for comparing the results. Tables V and VI tabulate the results from our experiments comparing the chip areas of the datapaths found by our GA with those synthesized using the other four scheduling algorithms. In Table V, the chip areas in bold indicate the smallest designs for each latency value. From the table, it can be seen that our GA nds better solutions than those of the other four scheduling techniques, for all the benchmarks tested. Table VI shows the improvement of
226
TABLE V COMPARISON OF OUR GA-BASED METHOD WITH OTHER SCHEDULING ALGORITHMSPART A (NOTE: ALU DELAY = 6:5 ns, MULTIPLIER DELAY = 15 ns, CLOCK PERIOD = 20 ns)
V. ANALYSIS AND DISCUSSION The search space of the high-level synthesis problem is large, complex, and highly multimodal, even for reasonably sized problems. There exists a many-to-one mapping between the the high-level synthesis subtasks of operation scheduling and resource allocation. For example, every module allocation maps to many possible operation schedules. Similarly, for every set of allocated modules, there exists a many-to-one mapping between the scheduled operations and the modules to which they can be bound. This is further compounded by the interdependence that exists between these two subtasks of high-level synthesis. Due to the nature of its search space, greedy search heuristics do not perform well in high-level synthesis. Scheduling algorithms such as ASAP and ALAP often result in suboptimal solutions, as seen from the synthesis results in Table IV. Other heuristics such as FDS and SAM, which schedule one operation at a time, have a tendency to be trapped in local optima. Techniques that use a more global approach, such as simulated annealing and GAs, tend to perform better than FDS and SAM, on the high-level synthesis problem. The interdependence of the high-level synthesis subtasks makes it imperative to perform operation scheduling and resource allocation simultaneously, to enable nding high-quality solutions. Several high-level synthesis systems such as SAM [39], OASIC [11], SALSA [14], Ly [16], PSGA [24], and Torbey [36] perform the synthesis subtasks simultaneously. Some of the earlier work on high-level synthesis [6][8], [10] performed these subtasks independently, which may result in suboptimal solutions. SAM schedules and allocates dataow graph operations one at a time. Due to this lack of a global approach to scheduling and allocation, the algorithm can be trapped by local optima. On the benchmarks tested in ours experiments, other approaches found better solutions than SAM. OASIC is an integer linear progarmming-based approach to high-level synthesis. It nds optimal solutions, but its worst case runtimes are exponential in the size of the problem. Hence, this technique is not suited to problems of practical size. The technique of simulated annealing always works on a single solution and traverses the search space by applying local transformations to the solution representation. The effectiveness of its search and its ability to escape from local minima depends, to a large extent, on the effectiveness of the transformations applied to a search problem. In addition, simulated annealing-based techniques need long run times to converge to a solution. Our GA-based technique found better solutions than SALSA [14], a simulated annealing-based approach, because of its more powerful search operators and its population-based search technique. Our technique was compared with three other GA-based high-level synthesis approaches, PSGA [24], that of Torbey et al. [36], and that of Heijligers et al. [26]. The quality of solutions found by our technique was better than the other three due to a better chromosome representation and GA search operators. In PSGA, the chromosomes use a problem-specic representation where each of the dataow graph operations
TABLE VI COMPARISON OF OUR GA-BASED METHOD WITH OTHER SCHEDULING ALGORITHMSPART B (NOTE: ALU DELAY = 6:5 ns, MULTIPLIER DELAY = 15 ns, CLOCK PERIOD = 20 ns)
the GA-based solutions over those of the other four scheduling techniques. The average improvements range from 4.03% (compared to the FDS method) to 43.52% (compared to the SAM method). Table VII compares our GA-based technique with others from the high-level synthesis literature. Our benchmark results are compared with those reported by PSGA [24], Ly et al. [16], Torbey et al. [36], Heijlingers et al. [26], SALSA [14], and OASIC [11]. Of these PSGA, Torbey et al. and Heijlingers et al. are techniques that apply GA-based methods to the highlevel synthesis problem. SALSA uses a simulated annealingbased approach for high-level synthesis, while Ly et al. uses a simulated evolution-based approach for high-level synthesis. Finally, OASIC addresses the high-level synthesis problem using an integer linear programming-based method. In all these approaches, the synthesized datapaths assume that ALUs take one clock cycle to complete an operation, while multipliers take two clock cycles to to complete an operation. In all the experiments done for Table VII, a clock period of 10 ns is assumed so that with the ALUs and multipliers chosen from our module library, all ALU operations can be done in one clock cycle, and all multiplications can be done in two clock cycles. Table VII compares solutions that use multicycled, as well as pipelined, multipliers. The best solutions (i.e., designs with the smallest chip areas for a given latency) are printed in bold. As can be seen from Table VII, for the instances tested, our technique nds solutions that are better than those found by the other techniques, for the specied design latencies for these benchmarks.
227
TABLE VII COMPARISON OF OUR GA-BASED METHOD WITH OTHER HEURISTIC HIGH-LEVEL SYNTHESIS = MULTICYCLED MULTIPLIERS USED, = PIPELINED TECHNIQUES (NOTE: MULTIPLIERS USED, CLOCK PERIOD = 10 ns)
is assigned a priority (called work remaining) based on the length of the longest path from the dataow graph node to an output. Integers are used to encode these priorities. The weakness of this chromosome representation is that in the worst case, an exponentially large number of genotype representations map to the same solution. This increases the size of the search space, slowing down the GA search. The GA-based high-level synthesis technique proposed in [26] also has a similar weakness. In [26], the authors represent chromosomes as random permutations of dataow graph operations and use standard permutation based crossover operators to search for good schedules. However, when there are dependencies among operations in a dataow graph, many permutations map to the same schedule, leading to slower GA search. The technique of [36] uses a binary encoding for the chromosomes. However, with this encoding the size of the chromosome increases exponentially with the size of the problem, with a corresponding increase in the size of the search space. To reduce the size of the search space, we use a permutation representation for the dataow graph operations, where the operations in the chromosome are topologically sorted based on the operator depedencies present in the input dataow graph. All the chromosomes in the population maintain this property at all times, and the GA search operators are designed to preserve operator precedences when creating new individuals. This aims to alleviate the many-to-one relationship that exists between the genotype and phenotype encoding when using a naive permutation representation. Our experiments conrm that this representation leads to fast convergence of the GA to good solutions. In addition, to maintain population diversity throughout the search process, and avoid premature convergence, we disallow duplicate chromosomes in the population. Results from the benchmarks tested indicate this is an effective strategy for improving GA performance. A natural extension of our work is to integrate it into an evolvable hardware-based HLS framework [40][42]. In this framework, the evolved circuit is implemented in an FPGA hardware. The population of solutions is maintained outside the circuit, and the evolutionary mechanisms of selection, crossover, mutation, and tness computation, are executed outside the resulting circuit, as well. However, the tness of the solutions is evaluated by programming them in the FPGA hardware, where the performance characteristics of the implemented circuit are de-
termined in terms of measurable metrics such as silicon area, operation speed, and dissipated power. This GA-based evolvable framework could use the chromosome representation proposed in our paper to encode solutions in the GA population. The datapath circuits are synthesized ofine using higher level functions (adders, multipliers, multiplexers, registers). Use of higher level functional modules reduces the search space by orders of magnitude when compared to using a gate-level (i.e., logic cell level) representation. In addition, it has the benet of reducing the size of the genome used to represent solutions. During tness evaluation, using the decoder proposed in our paper, each chromosome encoding in the GA population is decoded into a circuit netlist comprising of an interconnection of such higher level functional modules. Since these functional modules are drawn from a predesigned module library (represented as an interconnection of logic cells on the FPGA architecture), the decoded circuit is mapped to a netlist of logic cells implementing the desired circuit and loaded onto the FPGA circuit board for evaluation. The tness function suggested in our paper can be modied to incorporate hardware costs associated with the FPGA implementation substrate (gate count, routability, and number of FPGAs needed) as well as the actual operating speed (maximum clock rates) achievable with the resulting implementation. VI. SUMMARY AND CONCLUSION Efcient design space exploration during high-level synthesis is critical to the design of cost-effective VLSI circuits. In this paper, we have presented a GA-based approach for combined scheduling and allocation during high-level synthesis of datapaths during design space exploration. Our technique offers several advantages over traditional high-level synthesis methods. Our approach integrates the interdependent subtasks of scheduling and allocation, as opposed to treating these subtasks independantly, as is done in many traditional high-level synthesis systems. We have shown that concurrently performing scheduling and allocation is effective in obtaining high-performance datapath architectures as compared with conventional approaches which perform these subtasks sequentially. Our approach combines the global search technique of GAs with fast local search heuristics for scheduling and register allocation. This combination allows our synthesis system to search
228
a large, highly complex, and multimodal design space, in an efcient manner, in order to nd the best possible solution within reasonable CPU times. Our GA incorporates three key features that enable it to efciently perform design space exploration in high-level synthesis. First, a multichromosome encoding is used to concurrently handle the search in the space of schedules and allocations for an input behavioral specication. A novel encoding is used to specify datapath schedules. This encoding, designed to improve the efciency of the GA search for design space exploration, uses a chromosome representation that encodes the precedence relationships among the tasks in the input behavioral specifcation with a topological order-based representation to specify schedule priorities. This alleviates the many-to-one relationship that exists between the genotype and phenotype encoding in other GA-based high-level synthesis systems. This chromosome representation enables the GA to prune the search space, increasing search efciency, leading to signicantly better solutions than those obtained by other GA-based high-level synthesis systems. Second, our GA uses an efcient list-scheduling heuristic to decode chromosomes into valid schedules. The list-scheduling heuristic is not limited to using a single xed priority function as in other list-scheduling approaches, but instead uses adaptive task priorities that are dynamically decided by the GA. This enhances the robustness of our GA, enabling the scheduler avoid local optima. Third, our GA concurrently performs register minimization in the datapath, in addition to functional allocation. A variant of the Left Edge algorithm is used to determine the register storage requirements for each schedule. Since the cost of registers, in terms of chip area, can be signicant, our GA-based framework minimizes the number of registers needed in a synthesized datapath.
REFERENCES
[1] D. Gajski, N. Dutt, A. Wu, and S. Lin, High Level Synthesis: Introduction to Chip and System Design. Norwell, MA: Kluwer, 1992. [2] G. De Micheli, Synthesis and Optimization of Digital Circuits. New York: McGraw-Hill, 1994. [3] D. Thomas, E. Lagnese, R. Walker, J. Nestor, J. Rajan, and R. Blackburn, Algorithmic and Register-Transfer Level Synthesis: The System Architects Workbench. Norwell, MA: Kluwer, 1990. [4] M. McFarland and T. Kowalski, Incorporating bottom-up design into high-level synthesis, IEEE Trans. Comput.-Aided Des., vol. 8, no. 9, pp. 928950, Sep. 1990. [5] B. M. Pangrle and D. D. Gajski, Slicer: A state synthesizer for intelligent silicon compilation, in Proc. Int. Conf. Computer-Aided Des., 1987, pp. 536541. [6] P. G. Paulin and J. P. Knight, Force-directed scheduling for the behavioral synthesis of ASICs, IEEE Trans. Comput.-Aided Des., vol. 8, no. 6, pp. 661679, 1989. [7] A. C. Parker, J. T. Pizarro, and M. Mlinar, Maha: A program for datapath synthesis, in Proc. 23rd ACM/IEEE Design Automation Conf., 1986, pp. 461466. [8] R. Camposano, Path-based scheduling for synthesis, IEEE Trans. Comput.-Aided Des., vol. 10, pp. 8593, 1991. [9] R. K. Brayton, R. Camposano, G. De Micheli, R. Otten, and J. van Eijndhoven, The Yorktown silicon compiler system, in Silicon Compilation, D. D. Gajski, Ed. Reading, MA: Addison-Wesley, 1988, pp. 204310. [10] Z. Peng, Synthesis of VLSI systems with the CAMAD design aid, in Proc. 23rd ACM/IEEE Design Automation Conf., 1986, pp. 278284.
[11] C. H. Gebotys and M. I. Elmasry, Global optimization approach for architectural synthesis, IEEE Trans. Comput.-Aided Des., vol. 12, pp. 12661278, 1993. [12] C. T. Hwang, J. H. Lee, Y. C. Hsu, and Y. L. Lin, A formal approach to the scheduling problem in high-level synthesis, IEEE Trans. Comput.Aided Des., vol. 10, no. 2, pp. 464475, Feb. 1991. [13] S. Devadas and A. R. Newton, Algorithms for hardware allocation in data path synthesis, IEEE Trans. Comput.-Aided Des., vol. 8, pp. 768781, 1989. [14] J. A. Nestor and G. Krishnamoorthy, SALSA: A new approach to scheduling with timing constraints, IEEE Trans. Comput.-Aided Des., vol. 12, pp. 11071122, 1993. [15] G. Krishnamoorthy and J. A. Nestor, Data path allocation using extended binding model, in Proc. 32nd ACM/IEEE Design Automation Conf., 1992, pp. 279284. [16] T. A. Ly and J. T. Mowchenko, Applying simulated evolution to high level synthesis, IEEE Trans. Comput.-Aided Des., vol. 12, no. 2, pp. 389409, Feb. 1993. [17] E. Torbey and J. Knight, High-level synthesis of digital circuits using genetic algorithms, in Proc. Int. Conf. Evol. Comput., May 1998, pp. 224229. [18] E. F. Girczyc, R. J. Buhr, and J. P. Knight, Application of a subset of ADA as an algorithmic hardware description language for graph-based hardware compilation, IEEE Trans. Comput.-Aided Des., vol. CAD4, no. 1, pp. 134142, Jan. 1985. [19] J. Vanhoof, K. V. Rompaey, I. Bolsens, G. Goossens, and H. D. Man, High-Level Synthesis for Real-Time Digital Signal Processing. Norwell, MA: Kluwer, 1993. [20] R. Camposano and W. Rosenstiel, Synthesizing circuits from behavioral descriptions, IEEE Trans. Comput.-Aided Des., vol. 8, pp. 171180, 1989. [21] T. L. Adam, K. M. Chandy, and J. R. Dickson, A comparison of list schedules for parallel processing systems, Commun. ACM, vol. 17, pp. 685690, 1974. [22] T. Blickle and L. Thiele, A mathematical analysis of tournament selection, in Proc. 6th Int. Conf. Genetic Algorithms, 1995, pp. 916. [23] G. Syswerda, Uniform crossover in genetic algorithms, in Proc. 3rd Int. Conf. Genetic Algorithms, 1989, pp. 29. [24] M. K. Dhodhi, F. H. Hielscher, R. H. Storer, and J. Bhasker, Datapath synthesis using a problem-space genetic algorithm, in IEEE Trans. Comput.-Aided Des., vol. 14, 1995, pp. 934944. [25] N. Wehn et al., A novel scheduling and allocation approach to datapath synthesis based on genetic paradigms, in Proc. IFIP Working Conf. Logic Architecture Synthesis, 1991, pp. 4756. [26] M. J. M. Heijlingers, L. J. M. Cluitmans, and J. A. G. Jess, High-level synthesis scheduling and allocation using genetic algorithms, in Proc. Asia South Pacic Design Automation Conf., 1995, pp. 6166. [27] C. Mandal, P. P. Chakrabarti, and S. Ghose, GABIND: A GA approach to allocation and binding for the high-level synthesis of data paths, IEEE Trans. Very Large-Scale Integrated Circuits, vol. 8, no. 5, pp. 747750, Oct. 2000. Data Structures Evolution [28] Z. Michalewicz, Genetic Algorithms Programs, 3rd ed. New York: Springer-Verlag, 1996. [29] F. J. Kurdahi and A. C. Parker, REAL: A program for register allocation, in Proc. 24th ACM/IEEE Design Automation Conf., 1987, pp. 210215. [30] R. M. San and J. P. Knoght, Genetic algorithms for optimization of integrated circuit synthesis, in Proc. 5th Int. Conf. Genetic Algorithms, San Mateo, CA, 1993, pp. 432438. [31] M. Nourani and C. A. Papachristou, Stability-based algorithms for high-level synthesis of digital ASICs, IEEE Trans. Very Large Scale Integrated Syst., vol. 8, no. 4, pp. 431435, Aug. 2000. [32] J. H. Lee, Y. C. Hsu, and Y. L. Lin, A new integer linear programming formulation for the scheduling problem in datapath synthesis, in Proc. ICCAD-89, pp. 2023. [33] N. Park and A. C. Parker, Sehwa: A software package for synthesis of pipelines from behavioral specications, IEEE Trans. Comput.-Aided Des., vol. 7, pp. 356370. [34] S. Y. Kung, On supercomputing with systolic/wavefront array processors, Proc. IEEE, vol. 72, no. 7, pp. 867884, Jul. 1994. [35] D. J. Mallon and P. B. Denyer, A new approach to pipeline optimization, in Proc. Eur. Conf. Design Automation, 1990, pp. 8388. [36] E. Torbey and J. Knight, Performing scheduling and storage optimization simultaneously using genetic algorithms, in Proc. IEEE Midwest Symp. Circuits Systems, 1998, pp. 284287.
229
[37] M. Nourani and C. A. Papachristou, Stability-based algorithms for high-level synthesis of digital ASICs, IEEE Trans. Very Large-Scale Integration, vol. 8, no. 4, pp. 431435, Aug. 2000. [38] W. F. J. Verhaegh, P. E. R. Lippens, E. H. L. Aarts, J. H. M. Korst, A. Van der Werf, and J. L. V. Meerbergen, Efciency improvements for force directed scheduling, in IEEE Int. Conf. Computer-Aided Des., 1992. [39] R. J. Cloutier and D. E. Thomas, The combination of scheduling, allocation and mapping in a single algorithm, in Proc. 27th Design Automation Conf., Jun. 1990, pp. 7176. [40] M. Sipper, E. Sanchez, D. Mange, M. Tomassini, A. Prez-Uribe, and A. Stauffer, A hylogenetic, ontogenetic, and epigenetic view of bioinspired hardware systems, IEEE Trans. Evol. Comput., vol. 1, no. 2, pp. 8397, Apr. 1997. [41] X. Yao and T. Higuchi, Promises and challenges of evolvable hardware, IEEE Trans. Syst., Man, Cybern., pt. C, vol. 29, no. 1, pp. 8797, Feb. 1999. [42] J. C. Gallagher, S. Vigraham, and G. Kramer, A family of compact genetic algorithms for intrinsic evolvable hardware, IEEE Trans. Evol. Comput., vol. 8, no. 2, pp. 1126, Apr. 2004. [43] G. Ascia, V. Catania, and M. Palesi, A GA-based design space exploration framework for parameterized system-on-a-chip platform, IEEE Trans. Evol. Comput., vol. 8, no. 4, pp. 329346, Aug. 2004. [44] D. Jackson, Evolution of processor microcode, IEEE Trans. Evol. Comput., vol. 9, no. 1, pp. 4454, Feb. 2005.
Vyas Krishnan received the M.S. degrees in computer science and in physics from the University of South Florida, Tampa. He is currently working towards the Ph.D. degree in computer science and engineering at the University of South Florida. His current research interests are in the areas of evolutionary computation, multiobjective optimization, development of CAD algorithms for deep-submicron VLSI design, high-level synthesis, and techniques for low-power VLSI design.
Srinivas Katkoori (S92M97SM03) received the B.E. degree in electronics and communication engineering from Osmania University, Hyderabad, India, in 1992 and the Ph.D. degree in computer engineering from the Department of Electrical and Computer Engineering and Computer Science, University of Cincinnati (UC), Cincinnati, OH, in 1998. At UC, his doctoral dissertation was recognized with an Outstanding Dissertation, Honorable Mention. He is an Associate Professor in the Computer Science and Engineering Department, University of South Florida, Tampa. He joined the department in 1997 as an Assistant Professor and was promoted with tenure in Fall 2004. From 1999 to 2005, he served as the Faculty Advisor for the IEEE Computer Society Student Chapter of the USF CSE Department. He holds one U.S. patent (No. 6,963,217) entitled Method and Apparatus for Creating Circuit Redundancy in Programmable Logic Devices. To date, he has published 45 research publications in peer-reviewed journals and conferences. His research interests are in the general area of VLSI CAD and more specically, in high-level synthesis, power estimation and optimization, recongurable computing, and radiation-tolerant design and CAD. Dr. Katkoori is the recipient of the 2001 Faculty Early Career Development Award from the National Science Foundation. His teaching has been recognized with the Florida Outstanding Engineering Educator Award by the IEEE Florida Council (Region 3). He is a Member of the Association of Computing Machinery (ACM). From June 2003 to September 2005, he served as the Editor-in-Chief for the Suncoast Signal--the newsletter of the IEEE Florida West Coast Section. In 2006, he received the National-Level IEEE-USA Professional Achievement Award.

A Genetic Algorithm For The Design Space Exploration of Datapaths During High-Level Synthesis

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

A Genetic Algorithm For The Design Space Exploration of Datapaths During High-Level Synthesis

Hochgeladen von

Copyright:

Verfügbare Formate

IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 10, NO.

1089-778X/$20.00 2006 IEEE

IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 10, NO. 3, JUNE 2006

Fig. 2. Example of behavioral description and its dataow graph.

Fig. 3. Interdependence of subtasks in high-level synthesis.

IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 10, NO. 3, JUNE 2006

IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 10, NO. 3, JUNE 2006

Fig. 4. Proposed framework for high-level synthesis of datapaths.

Fig. 5. Structure of GA.

Fig. 6. Chromosome encoding used in GA.

IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 10, NO. 3, JUNE 2006

Fig. 7. One-point topological crossover.

Fig. 8. Precedence preserving shift mutation.

IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 10, NO. 3, JUNE 2006

Fig. 9. Resource allocation substring mutation.

resource allocation eld

Fig. 10. Overview of tness evaluation procedure.

Modied list-scheduling algorithm.

TABLE II EXAMPLE SCHEDULE BY MODIFIED LIST SCHEDULE HEURISTIC

IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 10, NO. 3, JUNE 2006

IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 10, NO. 3, JUNE 2006

IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 10, NO. 3, JUNE 2006

Das könnte Ihnen auch gefallen