0 Bewertungen0% fanden dieses Dokument nützlich (0 Abstimmungen)
75 Ansichten8 Seiten
This paper describes a multiway partitioning algorithm for parallel gate level Verilog simulation. It takes advantage of the design hierarchy present in a Verilog circuit design. Our algorithm relies upon a metric whose function is to balance the load and the communications between the modules of the Verilog design.
Originalbeschreibung:
Originaltitel
A Multiway Partitioning Algorithm for Parallel Gate Level Verilog Simulation
This paper describes a multiway partitioning algorithm for parallel gate level Verilog simulation. It takes advantage of the design hierarchy present in a Verilog circuit design. Our algorithm relies upon a metric whose function is to balance the load and the communications between the modules of the Verilog design.
Copyright:
Attribution Non-Commercial (BY-NC)
Verfügbare Formate
Als PDF, TXT herunterladen oder online auf Scribd lesen
This paper describes a multiway partitioning algorithm for parallel gate level Verilog simulation. It takes advantage of the design hierarchy present in a Verilog circuit design. Our algorithm relies upon a metric whose function is to balance the load and the communications between the modules of the Verilog design.
Copyright:
Attribution Non-Commercial (BY-NC)
Verfügbare Formate
Als PDF, TXT herunterladen oder online auf Scribd lesen
A multiway partitioning algorithm for parallel gate level Verilog simulation
Lijun Li and Carl Tropper
School of Computer Science McGill University Montreal, Canada lli22, carl@cs.mcgill.ca Abstract We describe, in this paper, a multiway partitioning algo- rithm for parallel gate level Verilog simulation. The algo- rithm is an extension of a multi-level algorithm which only creates two partitions. Like its predecessor, it takes advan- tage of the design hierarchy present in a Verilog circuit de- sign. The information it makes use of is contained in the modules and their instances. The algorithm makes use of a hypergraph model of the Verilog design in which a vertex in the hypergraph represents a module instance. Our new algorithm relies upon a metric whose function is to balance the load and the communications between the modules of the Verilog design. pre-simulation is used to to evaluate the partitioning metric. When compared to hMetis, a well known multilevel partitioning algorithm, our algorithmpro- duces a superior speedup and a reduced cut-size. 1 Introduction It is well known that partitioning plays a pivotal role in parallel simulation. [4, 17] attest to its importance in logic simulation. Because partitioning is an NP complete prob- lem, it is necessary to rely upon heuristics. These heuristics attempt to balance the often conicting requirements of (1) minimizing the communication between the nodes involved in the parallel simulation (2) maximizing the concurrency in the simulation (3) equalizing the load at the nodes. This paper extends the partitioning algorithm described in [16]. [16] describes an algorithm which takes advantage of the hierarchical information present in a Verilog design. Previous efforts [21, 9, 1, 18, 22, 14, 20, 11, 13] partitioned attened netlists of the circuits. The reason for this is that they were originally designed for oor planning (layout) and not for simulation. Focusing on netlists often produced big cut-sizes which in turn resulted in unacceptable commu- nication costs. [16] takes advantage of the design hierarchy by using Verilog modules and their instances as the basic partitioning element instead of gates. The module is the ba- sic unit of code in the Verilog language. Both behavioral and structural code can be contained in the module. The encapsulation property of the module gives designers the ability to reuse the module in a VLSI design. Moreover, the module provides an interface to the program while hiding the complexity inside of it. The algorithm produced a superior speedup and a smaller cut-size then the corresponding quantities produced by hMetis [11]. hMetis is a widely used multi-level algo- rithm. It was applied to a attened netlist. The algorithm described in [16] is only capable of pro- ducing two partitions. The algorithm described in this paper extends this algorithm to a multiway setting-it can produce a given number of partitions. In order to do this, we intro- duce a metric which allows a trade-off to be made between balancing the computational load and minimizing the com- munication. Pre-simulation is used in conjunction with this metric in order to supply the necessary data with which to evaluate the metric and to evaluate this trade-off. The parti- tion resulting in the best speedup is then selected for use in the simulation. The rest of this paper is organized as follows. Section 2 is devoted to related research. In section 3, we present the details of our design-driven partitioning algorithm. A comparison of the cut-size and of the execution time of our design-driven partitioning algorithm and htmis partitioning based on netlists is presented in section 4. The last section contains our conclusions and thoughts about future work. 2 Related work Graph partitioning is an NP hard problem, and as a con- sequence the algorithms used for static partitioning of gate level simulations are heuristic in nature. These heuristics have the following, possibly conicting, goals (1) extracting the maximum concurrency available in the simulation [11] (2) minimizing the communication between the LPs of the simulation[19] (3) balancing the computational load among 37th International Conference on Parallel Processing 0190-3918/08 $25.00 2008 IEEE DOI 10.1109/ICPP.2008.89 438 the nodes of the platform. Since these goals can conict with one another, later heuristics tend to achieve a balance between these goals, e.g. [22]. Iterative techniques afford the ability to improve solu- tions incrementally. After an initial partition, nodes are ex- changed between the partitions in order to improve the qual- ity of the partition. [12] introduces a coarsening phase in a multilevel hypergraph partitioning algorithm. During the coarsening phase, a sequence of successively smaller hyper- graphs is constructed. The purpose of coarsening is to cre- ate a smaller hypergraph while preserving the partitioning quality obtained from the original hypergraph. The authors claim that hMetis produces partitions that are consistently better than other widely used algorithms and is one to two orders of magnitude faster than other algorithms. [20] and [12] try to reduce the size of the hypergraph fromthe bottom up, i.e. they extract clusters from the attened netlist with- out worsening the quality of the partitioning of the original netlist. The algorithm described in this paper makes use of iterative partitioning. Most gate level partitioning algorithms make use of a attened netlist [2]. In doing so, they fail to take advan- tage of design information, including the relationship of the modules and their instances to one another. [16] takes ad- vantage of the hierarchical relationship of these modules in Verilog in an algorithm which can produce two partitions. The objective of the algorithm is to minimize the commu- nication between modules while balancing the load. Ini- tially, the algorithm makes use of cone partitioning [19]. This is followed by an iterative phase in which nodes are ex- changed between partitions in order to improve the quality of the partitioning. If necessary, an iterative phase follows in which nodes are exchanged between attened partitions. As previously mentioned, the algorithm produced a supe- rior speedup and a reduced cut-size compared to hMetis for a 1 million gate circuit. 3 Multi-way partitioning algorithm The algorithm follows the general contours of the one described in [16]. The algorithms goal is to minimize the communication between the processors subject to a con- straint on the magnitude of the load which can be placed on each processor. We describe the algorithm in the following sections. The rst section describes load balancing, the second describes how we pair partitions in order to accomplish the itera- tive exchange, followed by a section which contains an overview of the algorithm. The nal section contains re- marks on the pre-simulation which is employed in order to chose the nal partition. We make use of a hypergraph to model the circuit. In this hypergraph, there are two kinds of vertices. One is the ordinary gate, such as AND, OR, NAND, XOR.The other kind of vertex is a Verilog instance. Actually we can treat it as a super-gate with more complex logic than the regular gates. We associate the number of gates with each vertex in the hypergraph. The introduction of super-gates reduces the number of vertices in order to make the partitioning al- gorithm more efcient. 3.1 Load balancing constraint We dene the load on a processor as the number of gates in the partition assigned to the processor. The denition is a simplication in that it is based on the assumption that all of the gates in the circuit are equally active during the simulation. In fact, hotspots do occur throughout the course of the simulation. In recognition of this, we make use of a load balancing factor b which allows for unequal loads to be placed on different processors. load (1/k b/100) <= load[i] <= load (1/k + b/100) (1) In the formula 1, load[i] is the number of gates in par- tition i and load is the number of gates in the circuit. k represents the number of processors involved in the simu- lation. This load balancing constraint guarantees that the difference in the load assigned to two different processors is less than 2*b percent of the total load of the simulation. We have experimented with different values of k and b, and portray the effect of different choices of b in 4. 3.1.1 Pairing of partitions The recursive approach applies bipartitioning recursively until the desired number of partitions is obtained while the direct approach partitions the circuit directly. [7, 5] are di- rect multiway partitioning algorithms. Pairwise partitioning is a direct multiway partitioning al- gorithm. In pairwise multiway partitioning, the initial parti- tioning partitions the circuit into k partitions instead of just 2 partitions as in the recursive partitioning algorithm. In the next step, the algorithm chooses two partitions based on some criteria. Then swapping of circuit elements is exe- cuted recursively between the paired partitions in order to further minimize the cut-size of the pair. The pairing and the recursive moving is done iteratively until the cut-size is minimized and the load balancing constraint is achieved. Figure 1 shows the principle of the pairwise partitioning algorithm. In the initial partitioning, we can see that the circuit is partitioned into 8 partitions directly. Then in the next step, P000 and P100 are paired together for iterative cut-size improvement as indicated by the dotted line. Later P001 and P011 are paired together. 439 There are a number of criteria which can be used to chose the pairs of partitions. Random The pairing of partitions is random. It is sim- ple and efcient, but the pairing quality is not good Exhaustive The pairing of partitions will be every com- bination of the partitions. It is computationally com- plex but produces better results because it is able to climb out of local minima. Cut-based The pairing is done between the two parti- tions between which the cut-size is maximum. Gain-based The pairing is done between the two par- titions between which the cut-size reduction is maxi- mum. P P 011 P 000 P 001 P 010 P 111 P 100 P 101 P 110 P 011 P 000 P 001 P 010 P 111 P 100 P 101 P 110 P 011 P 000 P 001 P 010 P 111 P 100 P 101 P 110 Figure 1. Pairwise multiway partitioning algo- rithm The recursive algorithm is computationally simple and fast. However, it suffers from several limitations. If the number of partitions are not a power of 2, the desired num- ber of multiway partition cannot be achieved. Furthermore, as the algorithm proceeds, it becomes harder to reduce the cut-size since the partitioning is performed on ner and ner hypergraphs. This observation, combined with the asser- tion in [5] that their k-way pairwise partitioning algorithm produces good results efciently led us to chose the direct algorithm 3.2 Flattening In the event that super-gates are too large, the load bal- ancing constraint may be broken. Hence, as in [16] we at- ten the largest super-gate in the partition and employ itera- tive movement in order to achieve a better load balance. 3.3 The algorithm Figure 2 contains a owchart of the algorithm. A cone partitioning algorithm [19] is rst employed to generate an initial partition. Cone partitioning emphasizes the concur- rency present in the design. The algorithm starts at the pri- mary inputs of the circuit and traverses the hypergraph. A pairing process is then executed in order to pick candidates for iterative movement.The algorithm moves free vertices iteratively between the two partitions picked by the pairing until either there is no free vertex left or until no gain in the cut-size can be obtained. The algorithm then checks to see if the load meets the load balancing constraint. If the load balancing constraint is not met, the algorithm attens the largest super-gate and performs an iterative movement on the resulting hypergraph, as in [16]. The pairing, itera- tive movement and attening process are repeated until no pairing conguration is available. At the end of the parti- tioning algorithm, a minimal cut-size is achieved and the load balancing constraint is met. nitial partitioning terative moving No free vertex or no gain? Meets load balancing constraint Terminate Yes No Flattening and load redistribution Yes No Partition 1 Partition 2 Partition n Pre-simulation Pairing Best partition Setup partition number k and load balancing facotr b Pairing configuration iavailable? Yes No Figure 2. Flowchart of the design-driven par- titioning algorithm 440 3.4 Pre-simulation We use pre-simulation [6] to evaluate the trade-off be- tween load balance and the communication cost in order to nd the best compromise. The criterion used to evaluate a circuit partition is the speedup during the pre-simulation. The partition which produces the the best speedup for some choice of k and b is used in the circuit simulation. We used both brute force pre-simulation and heuristic pre-simulation. Brute force pre-simulation runs all combi- nations of parameters k and b while heuristic pre-simulation only runs a limited set of the k and b combinations in order to reduce the time devoted to pre-simulation. Basically the algorithm uses binary search to speedup the search of the best balance point between communication and load bal- ance. The algorithm is described in gure 3. Start with the maximum number of processors al- lowed. (This offers the maximum possible concur- rency. Sooner are later, no choice of b will overcome having too many processors) Allow b to vary from 7.5 to 15 (Picking 5 % allows only a 10% total variation in the loads. In our exper- iments, all processors have a speedup of less then 1 with 5%). Increase b until the speedup decreases for the rst time and halt when this happens. This is your choice of b. (Two pre-simulation runs are required for our circuit). maxSpeedup = 1 k = n b = 7.5 running = true while (k>=2) { for (b1=7.5; b1<15; b1+=2.5) { speedup = presimulation(k, b); if (speedup > maxSpeedup) maxSpeedup = speedup; else break; } k = k - 1; } return k, b; Figure 3. Pseudo-code of the heuristic pres- imulation algorithm The disadvantage of the heuristic algorithm is that it could be trapped in the local minimum so it is not able to locate the best balance point. We will study further how to reduce the presimulation time without sacricing the parti- tioning quality. 4 Experiments All of our experiments were conducted on a network of 4 computers, each of which has AMD Athlon (1G CPU) pro- cessors and 512M RAM. They are connected by a 1Gbyte Ethernet network. All of the machines run the Linux operat- ing system and MPICH[10] is used for message passing be- tween the processors. MPICH is a freely available, portable implementation of MPI(Message Passing Interface), a stan- dard for message-passing for distributed-memory applica- tions used in parallel/distributed computing. In our experiments we used a synthesized netlist for a Viterbi decoder, which has 388 modules and about 1.2M gates. We obtained the synthesized netlist from the Rensse- laer Polytechnic Institute [8]. A million random vectors are fed into the circuit for the full simulation while 10,000 ran- dom vectors are used for pre-simulation. We assume a unit gate delay and zero transmission delay on the wires. Each data point collected in the experiments is an average of ve simulation runs. The simulation time for 1 machine is the execution time, excluding the time for partitioning. We compare the cut-size and the speedup produced by our algorithm to the same quantities as produced by hMetis[12]. hMetis is a widely used multi-level partition- ing algorithm. 4.0.1 DVS Our experiments make use of DVS[15], a framework for distributed Verilog simulation. Figure 4 portrays the archi- tecture of DVS. The 3 layers of DVS are shown on the right side of gure 4. The bottom layer is the communication layer, which provides a common message passing interface to the upper layer. Inside this layer, the software communi- cation platform can be PVM or MPI. Users can chose one of them without affecting the code of upper layers. The middle layer is a parallel discrete event simulation kernel, OOCTW, which is an object-oriented version of Clustered Time Warp (CTW)[3]. It provides services such as rollback, state saving and restoration, GVT computation and fossil collection to the top layer. The top layer is the distributed simulation engine, which includes an event handler and an interpreter which executes instructions in the code space of a virtual thread. 441 vvp Parser Partitioner vvp Assembly Code Functor List Vthread List Simulation Results Distributed Simulation Engine OOCTW MPI / PVM Figure 4. Architecture of DVS k b Hyperedge cut 2 2.5 2428 2 5 1827 2 7.5 905 2 10 633 2 12.5 598 2 15 513 3 2.5 2930 3 5 2227 3 7.5 1230 3 10 894 3 12.5 863 3 15 790 4 2.5 3230 4 5 2326 4 7.5 1433 4 10 979 4 12.5 935 4 15 887 Table 1. cut-size with design-driven partition- ing algorithm 4.1 Cut-size We use different values of k and b to generate different cut-sizes. The hyperedge cut-size is dened as the number of hyperedges that span multiple partitions. Table 1 shows the hyperedge cut-size produced by our algorithm while ta- ble 2 lists the cut-sizes produced by the hMetis partitioning algorithm. The parameter b is the load balancing factor de- ned in formula 1 while k is the number of partitions. Table 1 and table 2 reveal that our algorithm resulted in a signicantly smaller cut-size then the one produced by hMetis. k b Hyperedge cut 2 2.5 2675 2 5 2673 2 7.5 2673 2 10 2669 2 12.5 2668 2 15 2665 3 2.5 2932 3 5 2932 3 7.5 2931 3 10 2935 3 12.5 2931 3 15 2927 4 2.5 3195 4 5 3195 4 7.5 3191 4 10 3191 4 12.5 3191 4 15 3191 Table 2. cut-size with hMetis partitioning al- gorithm 4.2 Pre-simulation We used 10,000 random vectors in our pre-simulation in order to pick the best partition for different combinations of partition number k and load balance factor b. The sequential simulation time of the circuit with 10,000 random vectors was 38.93 seconds. Table 3 shows the simulation time and speedup with these combinations. We list the best partitions as deter- mined by the largest speedup in table 4. 4.2.1 Pre-simulation algorithm:choosing k and b In spite of the short run time of the pre-simulations, it is not practical to try all combinations of k and b in a realistic envi- ronment. Consequently, we developed a heuristic algorithm described in section 3.4 for their choice in order to minimize the amount of time consecrated to the pre-simulation. 4.3 Simulation time We only treat the visible nodes in the circuit hypergraph as LPs. A visible node is one which is not inside a Ver- ilog instance. For a Verilog module, the states and input events are saved for each input port while the output events are saved for each output port. An invisible node without memory inside a Verilog module will not save its state and events. However, the invisible nodes with memory (e.g. a 442 k b cut-size Simulation time (Seconds) Speedup 2 2.5 2428 61.79 0.62 2 5 1827 41.86 0.93 2 7.5 905 30.65 1.27 2 10 633 25.78 1.51 2 12.5 598 23.59 1.65 2 15 513 29.72 1.31 3 2.5 2930 56.42 0.69 3 5 2227 39.72 0.98 3 7.5 1230 28.87 1.35 3 10 894 21.50 1.81 3 12.5 863 22.37 1.74 3 15 790 25.44 1.53 4 2.5 3230 88.47 0.44 4 5 2326 42.78 0.91 4 7.5 1433 19.86 1.96 4 10 979 24.80 1.57 4 12.5 935 21.04 1.85 4 15 887 24.18 1.61 Table 3. Pre-Simulation time with design- driven partitioning algorithm k b cut-size Simulation time (Seconds) Speedup 2 12.5 598 23.59 1.65 3 10 894 21.50 1.81 4 7.5 1463 19.86 1.96 Table 4. Best partition produced by design- driven partitioning algorithm k b cut-size Simulation time (Seconds) Speedup 2 12.5 598 2201.98 1.65 3 10 894 2033.35 1.79 4 7.5 1463 1905.60 1.91 Table 5. Simulation time with design-driven partitioning algorithm register) will still save their states. If a rollback happens in a Verilog module, every child inside of the Verilog module rolls back along with its parent. Previous results [16] indi- cate that we experience excessive memory consumption if all nodes are treated as LPs. Table 5 and gure 5 shows the simulation times and speedups with different combinations of the load balancing factor and cut-size. The sequential simulation time of the circuit is 3639.70. 1 2 3 4 1800 2000 2200 2400 2600 2800 3000 3200 3400 3600 3800 Number of machines S im u la tio n tim e Figure 5. Simulation time From gure 5 and from the data in table 5 we observe that as the number of machines increases from 2 to 4 the simulation time does not decrease signicantly. This is be- cause as the number of processors increases, the circuit is divided more nely and the design hierarchy is destroyed. As a result, the cut-size increases and the communication costs eventually overwhelms the algorithm. The importance of making a good choice of k and b is illustrated by the rst two rows of table 5. Here we see that the distributed simulation is slower than the sequential sim- ulation because of a poor choice of k and b. In the same vein, we note from table 5 that the minimum cut-size does not always result in the best performance. For example, we got the best performance on two machines with a combina- tion of a cut-size of 598 and a static load balancing factor of 12.5. Finally, we point out that the speedups of the full simu- lation with 1 million random vectors are slightly less than the speedups achieved from the pre-simulation with 10,000 random vectors. This is a conrmation of the observations made in [6]. 443 4.4 Messages and Rollbacks 2 3 4 0 1 2 3 4 5 6 7 8 x 10 5 Machine number M e s s a g e n u m b e r b=2.5 b=5 b=7.5 b=10 b=12.5 b=15 Figure 6. message number during the pre- simulation 2 3 4 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Machine number R o llb a c k n u m b e r b=2.5 b=5 b=7.5 b=10 b=12.5 b=15 Figure 7. rollback number during the pre- simulation Figures and conrm the relationship between load bal- ance and communication. Relaxing the load balancing con- straint (i.e. increasing b) results in fewer messages and roll- backs. With an increase in the number of machines, com- munication increases and both of these quantities increase. These results underscore the importance of pre-simulation in picking the nal partition. 5 Conclusion The algorithm described in this paper is an extension of the two-way partitioning algorithm describe in [16] to a multi-way partitioning algorithm, and is intended for a gate level Verilog design. Like its predecessor, it makes uses the information present in the Verilog design hierarchy to pro- duce the partition instead of using a attened netlist. It re- lies on pre-simulation to evaluate the trade-off between load balance and the communication between the nodes. The al- gorithm produced a 4.5 fold reduction in cut-size compared to hMetis [12] on the million gate circuit used in our exper- iments. The reduction in cut size and the preservation of design locality led to a speedup of 1.91 on a 4 node cluster. An interesting extension of the algorithm would be to make it responsive to changes in processor loads. Currently our load metric is the number of gates, which is not entirely adequate for this task. Due to the difculty of obtaining a large gate level Verilog design, we have conned our exper- iments to the circuit described in this paper. We are now in the process of synthesizing a gate level Verilog design from an open source RTL design for a Sparc computer so that we may experiment on a large, realistic design. References [1] A. B. Kahng A. E. Caldwell and I. L. Markov. De- sign and implementation of the duccia-mattheyses heuristic for vlsi netlist partitioning. In Proc. Work- shop on Algorithm Engineering and Experimentation (ALENEX), Baltimore, pages 177193, Jan. 1999. [2] C. J. Alpert and A. B. Kahng. Recent directions in netlist partitioning: A survey. Integr. VLSI Journal, 19(1-2):181, Aug 1995. [3] Herve Avril and Carl Tropper. Scalable clustered time warp and logic simulation. VLSI design, 00:123, 1998. [4] M. Bailey, J. Briner, and R. Chamberlain. Parallel logic simulation of vlsi systems. ACMComputing Sur- veys, 26(03):255295, Sept. 1994. [5] J. Cong and S. Lim. multiway partitioning with pair- wise movement. In Proceedings of the IEEE/ACM IC- CAD, pages 512516, 1998. [6] Chamberlain R. D. and Henderson C. Evaluating the use of presimulation in vlsi circuit partitioning. In Proc. 1994 Workshop on Parallel and Distributed Sim- ulation, pages 139146, 1994. [7] A. Dasdan and C. Aykanat. Two novel multiway parti- tioning algorithms using relaxed locking. IEEE Trans. on Computer-Aided Design, 16(2):169178, 1997. [8] L. Zhu et.al. Parallel logic simulation of million- gate vlsi circuits. In 13th IEEE International Sym- posium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MAS- COTS05), 2005. [9] C. Fiduccia and R. Matheyses. A linear-time heuristic for improving network partitions. ACM/IEEE Design Automation Conference, pages 175181, 1982. [10] Freeware. MPICH. http://www- unix.mcs.anl.gov/mpi/mpich. 444 [11] Vipin Kumar George Karypis, Rajat Aggarwal and Shashi Shekhar. Multilevel hypergraph partitioning: Applications in vlsi domain. In ACM/IEEE Design Automation Conference, pages 526529, 1997. [12] George Karypis, Rajat Aggarwal, Vipin Kumar, and Shashi Shekhar. Multilevel hypergraph partitioning: Applications in vlsi domain. IEEE Transactions on VLSI Systems, 7(1):6979, 1999. [13] George Karypis and Vipin Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. Technical Report TR 95-035, Department of Computer Science, University of Minnesota, Min- neapolis, MN, 1995. [14] H. K. Kim and J. Jean. Concurrency preserving partitioning(cpp) for parallel logic simulation. In 10th Workshop on parallel and distributed simula- tion(PADS95), pages 98105, May 1996. [15] Lijun Li, Hai Huang, and Carl Tropper. Dvs: an object-oriented framework for distributed verilog sim- ulation. In Parallel and Distributed Simulation, 2003. (PADS 2003), pages 173180, June 2003. [16] Lijun Li and Carl Tropper. A design-driven iterative partitioning algorithm for distributed verilog simula- tion. In 21st International Workshop on Principles of Advanced and Distributed Simulation (PADS 2007), pages 173180, June 2007. [17] Naraig Manjikian and Wayne M. Loucks. High per- formance parallel logic simulations on a network of workstations. In Proceedings of the seventh workshop on Parallel and distributed simulation, pages 7684, May 1993. [18] R.Chamberlain and C.Henderson. Evaluating the use of pre-simulation in vlsi circuit partitioning. In PADS94, pages 139146, 1994. [19] G. Saucier, D. Brasen, and J.P. Hiol. Partitioning with cone structures. IEEE/ACM International Conference on CAD, pages 236239, 1993. [20] Wenyong Deng Shantanu Dutt. Cluster-aware itera- tive improvement techniques for partitioning large vlsi circuits. ACM Transactions on Design Automation of Electronic Systems(TODAES), 7(1):91121, Jan 2002. [21] S. Smith, M. Mercer, and B. Underwood. An anal- ysis of several approaches to circuit partitioning for parallel logic simulation. In Proc. Int. Conference on Computer Design, IEEE, pages 664667, 1987. [22] Swaminathan Subramanian, Dhananjai M. Rao, and Philip A. Wilsey. Applying multilevel partitioning to parallel logic simulation. In Parallel and Distributed Computing Practices, volume 4, pages 3759, March 2001. 445