You are on page 1of 19

steps in the schedule. 21.

* A~sumc that y?U have ~ library that contains multiple implementatIOns ~f functIOnal Ulllts with different sizes and speeds. Devise an. algont~m that would perform scheduling in combination with Ulllt selectIOn. Hint: An intuitive strategy for scheduling is to use the fast components only for the operations that are critical to the. ~erformanc~ of the overall design while implementing the non-cntlcal operatIOns with the slower components.

Chapter 8

22. Comp~te t~e of implementing the control logic for the schedules g~ven III FIgure 7:12(b), Figure 7.13(a) and Figure 7.13(b) assumlllg a datapath wIth 3, 4 and 6 functional units and a shared register file with 6, 8 and 12 read ports and 3, 4 and 6 write ports. Assume that the register file has 256 words.


As described in the previous chapter, scheduling assigns operations to control steps and thus converts a behavioral description into a set of register trans;ers that can be described by a state table. A target architecture for such a description is the FSMD given in Chapter 2. We de.ive the control unit for such a FSMD from the control-step sequence and the conditions used to determine the next control step in the sequence. The datapath is derived from the register transfers assigned to each control step; this task is called datapath synthesis or datapath allocation. A datapath in the FSMD model is a netlist composed of three types of register transfer (RT) components or units: functional, storage and interconnection. Functional units, such as adders, shifters, ALUs and multipliers, execute the operations specified in the behavioral description. Storage units, such as registers, register files, RAMs and ROMs, hold the values of variables generated and consumed during the execution of the behavior. Interconnection units, such as buses and multiplexers, transport data between the functional and storage units. Datapath allocation consists of two essential tasks: unit selection and unit binding. Unit selection determines the number and types of RT components to be used in the design. Unit binding involves the mapping of the variables and operations in the scheduled CDFG into the functional, storage and interconnection units, while ensuring that the

variables a and e reside, must be connected to the input ports of ADD1; otherwise, operation 03 will not be able to execute in ADD1. Similarly, operations 02 and 04 are mapped to ADD2. Note that there are several different ways of performing the binding. For ~xample, we can map 02 and

to ADD1 and




to ADD2.


Besides implementing the correct behavior, the allocated datapath must meet the overall design constraints in terms of the metrics defined in Chapter 3 (e.g., area, delay, power dissipation, etc.). To simplify the allocation problem, we use two quality measures for datapath allocation: the total size (i.e., the silicon area) of the design and the worst-ca5(t register-to-register delay (Le., the clock cycle) of the design. We can solve the allocation problem in three ways: greedy approaches, which progressively construct a design while traversing the CDFG; decomposition approaches, which decompose the allocation problem into its constituent parts and solve each of them separately; and iterative methods, which try to combine and interleave the solution of the allocation subproblems. We begin this chapter with a brief discussion of typical datapath architectural features and their effects on the datapath-allocation problem. Using a simple design model, we then outline the three techniques for datapath allocation: the greedy constructive approach, the decomposition approach and the iterative refinement approach. Finally, we conclude this chapter with a brief discussion of future trends.

Figure 8.1: Mapping of behavioral objects into RT components.

design behavior operates correctly on the selected set of components. For every operation in the CDFG, we need a functional unit that is capable of executing the operation. For every variable that is used across several control steps in the scheduled CDFG, we need a storage unit to hold the data values during the variable's lifetime. Finally, for every data transfer in the CDFG, we need a set of interconnection units to effect the transfer. Besides the design constraints imposed on the original behavior and represented in the CDFG, additional constraints on the binding process are imposed by the type of hardware units selected. For example, a functional unit can execute only one operation in any given control step. Similarly, the number of multiple accesses to a storage unit during a control step is limited by the number of parallel ports on the unit. 'tie illustrate the mapping of variables and operations in the DFG of Figure 8.1 into RT components. Let us assume that we select two adders, ADD1 and ADD2, and four registers, T1, T2, T3 and T4' Operations 01 and 02 cannot be mapped into the same adder because they must be performed in the same control step S1' On the other hand, operation 01 can share an adder with operation 03 because they are carried out during different control steps. Thus, operations 01 and 03 are both mapped into ADD 1. Variables I! and C IllllSt be stored separately because their values are needed concurrently in control step S2' Registers Tl and T2, where




In Chapter 2, we discussed basic target architectures and showed how pipelined datapaths are used for performance improvements with negligible increase in cost. In Chapter 3, we presented formulas for calculating the clock cycle for such architectures. In this section we will review some basic features of real datapaths and relate them to the formulation of the datapath-allocation problem. A datapath architecture defines the characteristics of the datapath units and the interconnection topology. A simple target architecture may greatly reduce the complexity of the synthesis problems since the number of alternative designs is greatly reduced. On the other hand,

a less constrained architecture, although more difficult to synthesize, may result in higher quality designs. While an oversimplified datapath architecture leads to elegant synthesis algorithms, it also usually results in unacceptable designs. The interconnection topology that supports data transfers between the storage and functional units is one of the factors that has a significant influence on the datapath performance. The complexity of the interconnection topology is defined by the maximum number of interconnection units. between any two ports of functional or storage units. Each interconnection unit can be implemented with a multiplexer or a bus. For example, Figure 8.2 shows two datapaths, using multiplexer and bus interconnection units respectively, which implement the following five register transfers:
Output interconnection network

Input interconnection network

31: 32: 33:

T3 {= T1 {= T3 {=

ALUl (T1, T2)i ALUl (TS, T6)i ALUl (T1,T6)i.

Tl {= T6 {=


(T3, T4)i (T2,

Output interconnection network

We call the interconnection topology "point-to-point" if there is only one interconnection unit between any two ports of the functional and/or storage units. The point-to-point topology is most popular in high-level synthesis since it simplifies the allocation algorithms. In this topology, we create a connection between any two functional or storage units as needed. If more than one connection is assigned to the input of a unit, a multiplexer or a bus is used. In order to minimize the number of interconnections, we can combine registers into register files with multiple ports. Each port may support read and/or write accesses to the data. Some register files allow simultaneous read and write accesses through different ports. Although register files reduce the cost of interconnection units, each port requires/ dedicated decoder circuitry inside the register file, which increases the storage cost and the propagation delay. To simplify the binding problem, in this section we assume that all register transfers go through functional units and that direct interconnections of two functional units are not allowed. Therefore, we only need interconnection units to connect the output ports of storage units

Input interconnection network

Figure 8.2: Datapath interconnections: path, (b) a bus-oriented datapath.

(a) a multiplexer-oriented


to the input ports of functional units (i.e., the input interconnection network) and the output ports of functional units to the input ports of storage units (i.e., the output interconnection network). The complexity of the input and output interconnection networks need not be the same. One can be simplified at the expense of the other. For example, selectors may not be allowed in front of the input ports of storage units. This results in a more complicated input interconnection network, and, hence, an imbalance between the read and write times of the storage units. In addition to affecting the allocation algorithms, such an architecture increases testability [GrDe90]. Furthermore, the number of buses that each register file drives can be constrained. If a unique bus is allocated for each register-file output, tri-state bus drivers are not needed between the registers and the buses [GeEI90]. This restriction on register-file outputs produces more multiplexers at the input ports of functional units. Moreover, some variables may need to be duplicated across different register files in order to simplify the selector circuits between the buses and the functional units. Another interconnection scheme commonly used in processors has the buses partitioned into segments so that each bus segment is used by one functional unit at a time [Ewer90]. The functional and storage units are arranged in a single row with bus segments on either side, and so looks like a 2-bus architecture. A data transfer between the segments is achieved through switches between bus segments. Thus, we accomplish interconnection allocation by controlling the switches between the bus segments to allow or disallow data transfers. Let us analyze the delays involved in register-to-register transfers for the five-transfer example in Figure 8.2. The relative timing of read, ('X('C\l11' alld writ( Illirro-operaliolls ill the first lwo clock cycles of the I'xaIllple are shown in Figure 8.3. Let IT be the time delay involved for reading the data out of the registers and then propagating through the illput interconnection network; te, the propagation delay through a functional unit; and two the delay for data to propagate from the functional units through the output interconnection network and be written to the registers. In the target architectures described so far, all the components along the path from the registers' output ports back to the registers' input ports (i.e., the input interconnection network, the ALUs and the output interconnection network) are combinational. Thus, the clock
Read r5 R~ad_r6_ _ R~ad_r2~ R~ad-"5!..

- - r - - _. _ 1--_ _
1 1

- _ _

InBus1 InBus2 InBus3 InBus4 ALU1 ALU2 OutBus1 OutBus2

_ _


_E~cu.!!l_I _ L _E~cu~ _1 _ _ _ L _ _ _ _ l.YJri~




\.. Cycle 1 .

in the same

Figure 8.3: Sequential execution of three micro-operations clock period.

cycle will be equal to or greater than tT

+ te + two

As described in Chapter 2, latches may be inserted at the input ,1: and/or output ports of functional units to improve the datapath perfor-! mance. When latches are inserted only at the outputs of the functional units (Figure 8.4), the read accesses and functional-unit execution for;, .. operations scheduled into the current control step can be performed ,t~ at the same time as the write accesses of operations scheduled into . the previous control step (Figure 8.5). The cycle period is reduced to :' max( tT + te, tw). But the register transfers are not well balanced: read- ) ing of the registers and execution of an ALU operation arc performed i.n the first cycle, while only writing of the result back into the registers IS performed during the second cycle. Similarly, if only the inputs of the , functional units are latched, the read accesses for operations scheduled into the next control step can be performed at the same time as the functional unit-execution and the write accesses for operations scheduled into the current control step. The cycle period in that case will be max(tTl te+tw)' In either case, the register files and latches are controlled by a single-phase clock. Operation execution and the reading/writing of data can take place ,

r3 OutBus1 OutBus2 InBus1 InBus2 InBus3l0utBus InBus1 InBus2 InBus3 InBus4 InBus4l0utBus2 1


Figure 8.6: Insertion of latches at both the input and output ports of the functional units. Figure 8.4: Insertion of latches at the output ports of the functional units.

InBus1 InBus2 InBus3 InBus4 ALU1 ALU2 OutBus1 OutBus2 ~ Cycle 1


-1- -Read

rs rs





--1-- -1Wri~ Write rs 1

-1- -_ I ~ea~ r2
I Read Execute

+.~rit~r31. _


InBus2 InBus3! OutBus1 InBus4! OutBus2

J.I_ -

rs Writer,1

- - -ICycle

Execute Cycle

- --

- - -----Execute Cycle Execute

--Execute Cycle




Cycle 2


Cycle 3






concurrently when both the inputs and outputs of a functional unit are latched (Figure 8.6). The three combinational components, namely, the input-interconnection units, the functional units and the outputinterconnection units, can all be active concurrently. Figure 8.7 shows how this pipelining scheme works. The clock cycle consists of two minor cycles. The execution of an operation is spread across three consecutive clock cycles. The input operands for an operation are transferred from the register files to the input latches of the functional units during the second minor cycle of the first cycle. During the second cycle, the functional unit executes the operation and writes the result to the output latch by the end of the cycle. The result is transferred to the final destination, the register file, during the first minor cycle of the third cycle. A two-phase non-overlapping clocking scheme is needed. Both the input and output latches are controlled by one phase since the end of the read access and the end of operation execution occur simultaneously. The other phase is used to control the write accesses to the register files. By overlapping the execution of operations in successive control steps, we can greatly increase the hardware utilization. The cycle period is reduced to max(te, tT + tw). Moreover, the input and output networks can share some interconnection units. For example, OutBusl is merged with InBus3 and OutBus2 with InBus4 in Figure 8.6. Thus, inserting input and output latches makes available more interconnection units for merging, which may simplify the datapath design. By breaking the register-to-register transfers into micro-operations executed in different clock cycles, we achieve a better utilization of hardware resources. However, this scheme requires binding algorithms to search through a larger number of design alternatives. Operator chaining was introJuceJ in Section 7.3.1 as the execution of two or more operations in series during the same control step. To support operation chaining, links are needed from the output ports of some functional units directly to the input ports of other functional units. In an architecture with a shared bus (e.g., Figure 8.6), this linking can be accomplished easily by using the path from a functional unit's output port through one of the buses to some other functional unit's input port. Since such a path must be combinational, bypass circuits have to be added around all the latches along the path of chaining.

Datapath synthesis consists of four different yet interdependent tasks: module selection, functional-unit allocation, storage allocation and interconnection allocation. In this section, we define each task and discuss the nature of their interdependence.

A simple design model may assume that we have only one particular type of functional unit for each behavioral operation. However, a real RT component library contains multiple types of functional units, each with different characteristics (e.g., functionality, size, delay and power dissipation) and each implementing one or several different operations in the register-transfer description. For example, an addition can be carried out by either a small but slow ripple adder or by a large but fast carry look-ahead adder. Furthermore, we can use several different component types, such as an adder, an adder/subtracter' or an entire ALU, to perform an addition operation. Thus, unit selection selects the number and types of different functional and storage units from the component library. A basic requirement for unit selection is that the number of units performing a certain type of operation must be equal to or greater than the maximum number of operations of that type to be performed in any control step. Unit selection is frequently combined with binding into one task called allocation.

After all the functional units have been selected, operations in the behavioral description must be mapped into the the set of selected functional units. Whenever we have operations that can be mapped into more than one functional unit, we need a functional-unit binding algorithm to determine the exact mapping of the operations into the functional units. For example, operations 01 and 03 in Figure 8.1 have been mapped into adder ADD1, while the operations 02 and 04 have been mapped into adder ADD2.

Storage binding maps data carriers (e.g., constants, variables and di,l.ta structures like arrays) in the behavioral description to storage elements (e.g., ROMs, registers and memory units) in the datapath. Constants, such as coefficients in a DSP algorithm, are usually stored in a read-only memory (ROM). Variables are stored in registers or memories. Variables whose lifetime intervals do not overla.p with other may share the same register or memory location. The lifetime of a variable is the time interval between its first value assignment (the first variable appearance on the left-hand side of an assignment statement) and its last use (the last variable appearance on the right-hand side of an assignment statement). After variables have been assigned to registers, the registers can be merged into a register file with a single access port if the registers in the file are not accessed simultaneously. Similarly, registers can be merged into a multi port register file as long as the number of registers accessed in each control step does not exceed the number of ports.

Every data transfer (i.e., a read or write) needs an interconnection path from its source to its sink. Two data transfers can share all or part of the interconnection path if they do not take place simultaneously. For example, in Figure 8.1, the reading of variable b in control step Sl and variable e in control step S2 can be achieved by using the same interconnection unit. However, writing to variables e and j, which occurs simultaneously in control step Sl, must be accomplished using disjoint paths. The objective of interconnection binding is to maximize the sharing of interconnection units and thus minimize the interconnection cost, while still supporting the conflict-free data transfers required by the register-transfer description.

Figure 8.8: Interdependence of functional-unit and storage binding: (a) a scheduled DFG, (b) a functional-unit binding requiring six multiplexers, (c) improved design with two fewer multiplexers, obtained by register reallocation, (d) optimal design with no multiplexers, obtained by modifying functional-unit binding.

All the datapath synthesis tasks (i.e., scheduling, unit selection, functional unit binding, storage binding and interconnection binding) depend on each other. In particular, functional-unit, storage and inter-

connection binding are tightly related to each other. For example, Figure 8.8 shows how both functional unit and storage binding affect interconnection allocation. Suppose the eight variables (a through g) in the DFG of Figure 8.8( a) have been partitioned into four registers as follows: Tl <- {a}, T2 <- {b,e,g}, T3 <- {e,f,h}, T4 <- {d}. Given two adders, ADDl and ADD2, there are two ways of grouping the four adclition operations, 01,02,03, and 04, in Figure 8.8(a) so that each group is assigned to one adder: (1) ADDl (2) ADDl
<- {01, 04},





<- {01,03},


For the given register binding, we need six 2-to-l multiplexers for unit interconnection in case (1) (Figure 8.8(b)). However, we can eliminate two 2-to-1 multiplexers (Figure 8.8( c)), by modifying the register binding to be: Tl <- {a,g}, T2 <- {b,e}, T3 +- {e,j}, T 4 <- {d, h}. If we then modify the functional-unit binding to case (2) above, no multiplexers are needed (Figure 8.8(d)). Thus, this design is optimal if the interconnection cost is measured by the number of multiplexers needed. Clearly, both functional-unit and storage binding potentially affect the optimization achievable by interconnection allocation. The previous example also raises the issue of ordering among the allocation tasks. The requirements on interconnection become clear after both functional-unit and storage allocation have been performed. Furthermore, functional-unit allocation can make correct decisions if storage allocation is done beforehand, and vice versa. To break this deadlock sit uation, we choose one task ahead of the other. Unfortunately, in such a ordering, the first task chosen cannot use the information from the second task, which would have been available had the second task been performed first.

A constructive algorithm starts with an empty datapath and builds the datapath gradually byadding functional, storage and interconnecti2n units as necessary. For each operation, it tries to find a functional unit on the partially designed datapath that is capable of executing the operation and is idle during the control step in ~hich the operation must be executed. In case there are two or more functional units that meet !~ese conditions, we choose the one which results in a minimal increase in the interconnection cost. On the other hand, if none of the functional units on the partially designed datapath meet the conditions, We add a new functional unit from the component library that is capable of carrying out the operation. Similarly, we can assign a variable to an available register only if its lifetime interval does not overlap with those of variables already assigned to that register. A new register is allocated only when no allocated register meets the above condition. Again, when multiple alternatives exist for assignment of a variable to a register, we select the one that minimally increases the datapath cost. Figures 8.9(b )-(g) show several modified datapaths obtained by adding interconnection and/or functional units to the partial design of Figure 8.9(a). We discuss each case below: . 1. In Figure 8.9(b) we add a new connection from each of to the left input of ALUl.

and Busl

2. In' Figure 8.9( c) we add a connection from T3 to the right input of ALU2 through Busl. Since the connection from Busl to the right input of ALU2 already exists, we add only a tri-state buffer. 3. Figure 8.9(d) is similar to Figure 8.9(c), except that we add the multiplexer Mux2 instead of a tri-state buffer. 4. In Figure 8.9(e) we add a new functional unit, ALU3, and a new connection from each of T3 and Busl to the right input of ALU3. 5. Figure 8.9(f) is similar to Figure 8.9(e), except that we add the multiplexer Mux2 instead of a tri-state buffer.

We can construct a datapath using a greedy approach in which RT components arc assigned to operations in a step- by-step fashion [KuPa90].

Interconnection allocation follows immediately after both the source and sink of a data transfer have been bound. For example, the partial



rmpr ~ r~r r ~

datapath in Figure 8.9( a) does not have a link between register r3" and ALU2. Suppose a variable that is one of the inputs to an operation that has been assigned to A L U2 is just assigned to register r3' Then, an interconnection link has to be established as shown by the bold wires in Figure 8.9( c). Each interconnection unit that is required by a data transfer in the behavioral description contributes to the datapath cost. For example, in Figure 8.9(b), two more connections are made to the left input port of ALUl, whcre a two-input mult.iplexer already exist.s. Therefore, the cost of this modification is the diffcrcnce octwccn thc cost of a two-input multiplexer and that of a four-input multiplexer. However, at least two additional data t.ransfers are now supported wit.h
r r

this modification. Algorithm 8.1 describes the greedy constructive allocation method. Let U BE be the set of unallocated behavioral entities and DPcurrent be the partially designed datapath. The behavioral entities being considered could be variables that have to be mapped into registers, operations that have to be mapped into functional units, or data transfers that have to be mapped into interconnection units. DPcurrent is initially empty. The procedure ADD(DP, ube) structurally modifies the datapath DP by adding to it the components necessary to support the behavioral entity ube. The function CaST(DP) evaluates the area/performance cost of a partially designed datapath DP. DPwork is a temporary datapath which is created in order to evaluate the cost Cwork of performing each modification to D Pcurrent Starting with the set UBE, the inner for loop determines which unallocated behavioral entity, BestEntity, requires the minimal increase in the cost when added to the datapath. This is accomplished by adding each of the un allocated behavioral entities in U BE to DPcurrent individually and then evaluating the resulting cost. The procedure ADD then modifies DPcurrent by incorporating BestEntity into the datapath. BestEntity is deleted from the set of un allocated behavioral entities. The algorithm iterates in the outer while loop until all behavioral entities have been allocated (Le., U BE 4 ..


Mux1 ~ ALUI





Busl ALU1 'C?'ALU2

Figure 8.9: Datapath construction: (a) an initial partial design, (b) addition of two more inputs to a multiplexer, (c) addition of a tri-state buffer to a bus, (d) add~tion of a multiplexer to the input of a functional unit, (e) addition of a functional unit and a tri-state buffer to a bus, (f) addition of a functional unit and a multiplexer, (g) conversion of a multiplexer to a shared bus.

In order to use the greedy constructive approach, we have to address two basic issues: the cost-function calculation and the order in which the unallocated behavioral entities are mapped into the datapath. The costs can be computed as explained in Chapter 3. For example, t.he cost

D Pcurrent ; while UBE -I- do



for all ube E UBE do DPwork = ADD(DPcurrent, ube); cwork = COST(DPwork); if Cwork < LowestCost then LowestCost = Cwork;

endif endfor DPcurrent


The intuitive approach taken by the constructive method falls into the category of greedy algorithms. Although greedy algorithms are simple, the solutions they find can be far from optimal. In order to improve the quality of the results, some researchers have proposed a decomposi~ tion approach, where the allocation process is divided into a sequence of independent tasks; each task is transformed into a well-defined problem in graph theory and then solved with a proven technique. While a greedy constructive approach like the one described in Algorithm 8.1 might interleave the storage, functional-unit, and interconnection allocation steps, decomposition methods will complete one task before performing another. For example, all variable-to-register assignments might be completed before any operation-to-functional-unit assignments are performed, and vice versa. Because of interdependencies among these tasks, no optimal solution is guaranteed even if all the tasks are solved optimally. For example, a design using three adders may need one fewer multiplexer than the one using only two adders. Therefore, an allocation strategy that minimizes the usage of adders is justified only when an adder costs more than a multiplexer. In this section we describe allocation techniques based on three graphtheoretical methods: clique partitioning, left-edge algorithm and the weighted bipartite matching algorithm. For the sake of simplicity, we illustrate these allocation techniques by applying them to behavioral descriptions consisting of straight-line code without any conditional branches.


= U BE

= ADD(DPcurrent,


- BestEntity;

of converting the datapath in Figure 8.9( a) to the one in Figure 8.9(g) depends on the difference between the cost of one 2-tOol multiplexer and that of three tri-state buffers. Since the buses of Figure 8.9( a) and Figure 8.9(g) are of different length, the two datapaths may also have different wiring costs. The order in which unallocated entities are mapped into the datapath can be determined either statically or dynamically. In a static approach, the objects are ordered before the datapath construction begins. The ordering is not changed during the construction process. By contrast, in a dyll,unic approach lIOordering is done beforehand. To select an operation or variable for binding to the datapath, we evaluate every unallocated behavioral entity in terms of the cost involved in modifying the partial datapath, and the entity that requires the least expensive modification is chosen. After each binding, we revaluate the costs associated with the remaining unbound entities. Algorithm 8.1 uses the dynamic strategy. In hardware sharing, a previously expensive binding may become inexpensive after some other bindings are done. Therefore, a good strategy incorporates a look-ahead factor into the cost function. That is, the cost of a modification to the datapath should be lower if it decreases

The three tasks of storage, functional-unit and interconnection allocation can be solved independently by mapping each task to the well known problem of graph clique-partitioning [TsSi86J. We begin by defining the clique-partitioning problem. Let G = (V, E)

denote a graph, where V is the set of vertices and E the set of edges. Each edge ei,j E E links two different vertices Vi and Vj E V. A subgraph SG of G is defined as (SV, SE), where SV ~ V and SE = {ei,j I ei,j E E, Vi, Vj E SV}. A graph is complete if and only if for every pair of its vertices there exists an edge linking them. A clique of G is a complete subgraph of G. The problem of partitioning a graph into a minimal number of cliques such that each node belongs to exactly one clique is called clique partitioning. The clique-partitioning problem is a classic NP-complete problem for which heuristic procedures are usually used. Algorithm 8.2 describes a heuristic proposed by Tseng and Siewiorek [TsSi86] to solve the clique-partitioning problem. A super-graph G'( S, E') is derived from the original graph G(V, E). Each node Si E S is a super-node that can contain a set of one or more vertices Vi E V. E' is identical to E except that the edges in E' now link super-nodes in S. A super-node Si E S is a common neighbor of the two supernodes Sj and Sk E S if there exist edges ei,j and ei,k E E'. The function COMMON_NEIGHBOR(G',si,Sj) returns the set of supernodes that are common neighbors of Si and Sj in G'. The procedure DELETE-EDGE(E', Si) deletes all edges in E' which have Si as their eml super-node. Initially, each vertex Vi E V of G is placed in a separate supernode Si E S of G'. At each step, the algorithm finds the super-nodes Slndexi and Slndex2 in S such that they are connected by an edge and have the maximum number of common neighbors. These two supernodes are merged into a single super-node, Slndexllndex2, which contains all the vertices of Slndexi and Slndex2. The set CommonSet contains all the common neighbors of Slndexi and Slndex2 All edges originating from Slndexi or Slndex2 in G' are deleted. New edges are added from Slndexllndex2 to all the super-nodes in CommonSet. The above steps are repeated until there are no edges left in the graph. The vertices contained in each super-node Si E S form a clique of the graph G. Figure 8.10 illustrates the above algorithm. In the graph of Figure 8.10( a), V = {VI, V2, V3, V4, vs} and E = {el,3, eI,4, e2,3, e2,S, e3,4, e4,s}. Initially, each vertex is placed in a separate super-node (labeled SI through Ss in Figure 8.10(b)). The three edges, e~,3' e~,4 and e~,4' of the super-graph G' have the maximum number of common neighbors among all edges (Figure 8.10(b)). The first edge, e~,3' is selected and the


create a super graph G'(S, E')




= <1>;


for each for each

E V do ei,j E E do

= {v;};

= S u {s;};


E' = E' U {c'I,l.}.'



having most common neighbors -1; for each ei,j E E' do Ci,j I COMMON_NEIGHBOR(G',si,Sj) I; if Ci,j > M ostCommons then
Slndexl, Slndex2



M ostCommons Index1


= i;


= j;

endif endfor




E' E'

delete all edges linking

= =





Slndexd; Slndex2);










S = S - Slndexi - Slndex2; S = S U {Slndexllndex2};

/* add edge from Slndexllndex2 to super-nodes for each Si E CommonSet do

E' endfor




= E'

U {ei,lndexllndex2};




the only common neighbor of

an d')e3,4'




is put in CommonSet.
Edge 81

(2) All edges are deleted that link either super-nodes




(i.e., e~.3'

Common neighbors




(3) Super-nodes




are combined into a new super-node


(4) An edge is added between 813 and each super-node in CommonS et; i.e., the edge e13,4 is added. On the next iteration, 84 is merged into 813 to yield the super-node 8134 (Figure 8.10(d)). Finally, 82 and 85 are merged into the super-node 825 (Figure 8, lO( e)). The cliques are 8134 {VI, V3, V4} and 825 {V2' vs} (Figure 8.10(f)).

s ' ...... "


I \

,'CSJ' ;--i,'(:;;"
v3 v4







e;,4 e;,3 e;,5

0 0

' ..._....


N' (S)

-'~ \

e;,4 e~,5

,,_, (b)

In order to apply the clique partitioning technique to the allocation -problem, we have to first derive the graph model from the input description. Consider register allocation as an example. '~'he primary goal of register allocation is to minimize the register cost by maximizing the sharing of common registers among variables. To solve the register allocation probkm, we construct a graph G = (V, E), in which every vertex Vi E V uniquely represents a variable Vi and there exists an edge ei,) E E if and only if variables Vi and Vj can be stored in the same register (i.e., their lifetime intervals do not overlap). All the variables whose representative vertices are in a clique of G can be stored in a single register. A clique partitioning of G provides a solution for the datapath storage-allocation problem that requires a minimal number of registers. Figure k. J 1 shows ;1- solution of the register-;dlocation problem using the clique-partitioning algorithm. Both functional-unit allocation and interconnection allocation can be formulated as a clique-partitioning problem. For functional-unit allocation, each graph vertex represents an operation. An edge exists between two vertices if two conditions are satisfied: (1) the two operations are scheduled into different control steps, and unit that is capable of carrying out both

Common neighbors


52 , _ ,


Common neighbors

o , ,

/0'\ , :''\.!)'; 0, \j} \




,'f0 , \.y

" "-\


0, '6)-'" \:VI'
, I




(2) there exists a functional operations.

Figure 8,10: Clique partitioning: (a) given graph G, (b) calculating the common neighbors for the edges of graph G', (c) super-node 813 formed by considering edge e~,3' (d) super-node 8134 formed by considerinlr edlre

A clique-partitioning solution of this graph would yield a solution for the functional-unit allocation problem. Since a functional unit is assigned to each clique, all operations whose representative vertices are in a clique are executed in the same functional unit.
V,~'!lV4'!;V6 v.,






~ ~

4 5

_1_ ~ - -

t-+-~-- -r; -1-1R


For interconnection-unit allocation, each vertex corresponds to a connection between two units, whereas an edge links two vertices if the two corresponding connections are not used concurrently in any control step. A clique-partitioning solution of such a graph implies partitioning of connections into buses or multiplexers. In other words, all connections whose representative vertices are in the same clique use the same bus or multiplexer. Although the clique-partitioning method when applied to storage allocation can minimize the storage requirements, it totally ignores the interdependence between storage and interconnection allocation. Paulin and Knight [PaKn89] extend the previous method by augmenting the graph edges with weights that reflect the impact on interconnection complexity due to register sharing among variables. An edge is given a higher weight if sharing of a register by the two variables corresponding to the edge's two end vertices reduces the interconnection cost. On the other hand, an edge is given a lower weight if the sharing causes an increase in the interconnection cost. The modified algorithm prefers cliques with heavier edges. Hence, variables that shar(! a common register are more likely to reduce the interconnection cost.

- - -(b)



r, r2 r3

{v, va} (~' v3 vg} {v4 vS' v,,}

('ll, v7}





Figure 8.11: Register allocation using clique partitioning: (a) a scheduled DFG, (b) lifetime intervals of variables, (c) the graph model for register allocation, (c) a clique-partitioning solution.

The left-edge algorithm [HaSt71] is well known for its application in channel-routing tools for physical-design automation. The goal of the channel routing problem is to minimize the number of tracks used to connect points on the channel boundary. Two points on the channel boundary are connected with one horizontal (i.e., parallel to the channel) and two vertical (i.e., orthogonal to the channel) wire segments. Since the channel width depends on the number of horizontal tracks used, the channel-routing algorithms try to pack the horizontal segments into as few tracks as possible. Kurdahi and Parker [KuPa87] apply the left-edge algorithm to solve the register-allocation problem, in which variable lifetime intervals correspond to horizontal wire segments and registers to

wiring tracks. The input to the left-edge algorithm is a list of variables, L. A lifetime interval is associated with each variable. The algorithm makes several passes over the list of variables until all variables have been assigned to registers. Essentially, the algorithm tries to pack the total lifetime of a new register allocated in each pass with as many variables whose lifetimes do not overlap, by using the channel-routing analogy of packing horizontal segments into as few tracks as possible. Algorithm 8.3 describes register allocation using the left-edge algorithm. If there are n variables in the behavioral description, we define L to be the list of all variables Vi, 1 ~ i ~n. Let the 2-tuple <Start(v), End(v represent the lifetime interval of a variable V where Start( v) and End( v) are respectively the start and end times of its lifetime interval. The procedure SORT(L) sorts the variables in L in ascending order with their start times, Start( v), as the primary key and in descending order with their end times, End( v), as the secondary key. The procedure DELETE (L, v) deletes the variable v from list L, FIRST(L) returns the first variable in the sorted list Land NEXT(L, v) returns the variable following v in list L. The array MAP keeps track of ,the registers assigned to each variable. The value of reg_index represents the index of the register being allocated in each pass. The end time of the interval of the most recently assigned variable in that pass is contained in last. Initially, the variables are not assigned to any of the registers. During each pass over L, variables are assigned to a new register, rreg_index' The first variable from the sorted list whose lifetime does not overlap with the lifetillle of any other vari~bks assigned to r"g_iwb: is assigned to the sallie rq!;ist(,f. When a va.riable is assigned to a register, the register is entered in the array AI AP for that variable. On termination of the algorithm. the array AI AP contains the registers assigned to all the variables and reg_index represents the total number of registers allocated. Figure 8.12(a) depicts the sorted list of the lifetime intervals of the variables of the DFG in Figure 8.11(a). Note that variables VI and V2 in Figure S.ll(a) are divided into two variables each (VI, v~ and V2, v~) in order to obtain a better packing density. Figure 8.12(b) illustrates how the left-edge algorithm works on the example. Starting with a new empty register 1'1, the first variable in the sorted list, VI, is put into 1'1' Traveling

Algorithm 8.3: Register Allocation using Left-Edge Algorithm. for all vEL SORT(L)j do M AP[v]



reg_index = OJ while L :f </> do reg_index = reg_index curr_var = FIRST(L); last = 0;

+ 1;

while curr _var :f null do if Start( curr _var) ~ last then M AP[curr _var] = rreg_index; last = End( curT-var)j temp_var = curr _var; curr _var = NEXT(L, curr _var); DELETE( L, temp_var); else curr _var = NEXT(L, curr _var); endif endwhile

down the list, no variables can be packed into 1'1 before VB is encountered. After packing VB into 1'1, the next packable variable down the list is VI'.No more variables can be assigned to 1'1 without overlapping variable lifetimes. lienee the algorithm allocates a new register (7'2) and starts from the beginning of the list again. rhe sorted list now has thre~ fewer variables than it had in the beginning (i.e., VI, VB and VI' have been removed). The list becomes empty after five registers have been allocated. Unlike the clique-partitioning problem, which is NP-complete, the left-edge algorithm has a polynomial time complexity. Moreover, this algorithm allocates the minimum number of registers [KuPaS7]. However, it cannot take into account the impact of register allocation on the interconnection cost, as can the weighted version of the clique-partitioning

5, ...

I .. . .
, +- +, .








.. .. ..







V3 ""Vs ":";
I1II 1"1










the registers {TI, T2, ... , Tnum_reg} into which the set of variables, V, will be mapped. The set of variables is partitioned into clusters of variables with mutually overlapping lifetimes. The function OVERLAP( ClusteTi, v) returns the value true if either ClusteTi is empty or if the lifetime of variable v overlaps with the lifetimes of all the variables already assigned to ClusteTi. The function BUILD_GRAPH (R, V) returns the set of edges E containing edges between nodes of the two sets R, representing registers, and V, representing variables in a particular cluster. An edge ei,j E E will represents feasible assignment of variable Vj to a register Ti if and only if the lifetime of Vj does not overlap with the lifetime of any variable already assigned to Ti The procedure MATCHING(G(R U V, E)) finds the subset E' ~ E, which represents a bipartite-matching solution of the graph G. Each edge ei,j in E' represents the assignment of a variable Vj E V to register Ti E R . No two edges in E' share a common end-node (i.e., no two variables in the same cluster can be assigned to the same register). The array MAP, procedures SORT and DELETE, and the function FIRST are identical to that of the left-edge algorithm. After sorting the list of variables according to their lifetime intervals in the same way as the left-edge algorithm, this algorithm divides them into clusters of variables. The lifetime of each variable in a cluster overlaps with the lifetimes of all the other variables in the same cluster (Figure 8.13(a)). In the figure, the maximum number of overlapping lifetimes is five. Thus, the set of registers, R, will have five registers (TI, T2, T3, T4 and TS) into which all the variables will be mapped. A cluster of variables is assigned to the set of registers simultaneously. For example, the first cluster of five variables VI, VIO, V4, V6 and V2, shown in Figure 8.13(b), have been assigned to the registers TI, T2, T3, T4 and TS respectively. The algorithm then tries to assign the second cluster of three variables, V3, Vs and V7, to the registers. A variable can be assigned to a register only if its lifetime does not overlap with the lifetimes of all the variables already assigned to that register. In Figure 8.13(b), each graph edge represents a possible variableto-register assignment. As in the clique-partitioning algorithm, weights can be associated with the edges. An edge ei,j is given a higher weight if the assignment of Vj to Ti results in a reduced interconnection cost. For example, let variable Vm be bound to register Tn. If another variable

Figure 8.12: Register allocation using the left-edge algorithm: variable lifetime intervals, (b) five-register allocation result.

(a) sorted

Both the register and functional-unit allocation problems can be transformed into a weighted bipartite-matching algorithm [HCLH90]. Unlike tr..e left-edge algorithm, which binds variables sequentially to one register at a time, the bipartite-matching algorithm binds multiple variables to multiple registers simultaneously. Moreover, it takes interconnection cost into account during allocation of registers and functional units. . Algorithm 8.4 describes the weighted bipartite-matching method. Let L be the list of all variables as defined in the description of the left-edge algorithm. The number of required registers, num_Teg, is the maximum density of the lifetime intervals (i.e., the maximum number of ,overlapping variable lifetimes). Let R be the set of nodes representing


8.4: Register Allocation using Bipartite do M AP[v]


for all vEL SORT(L);


~ ~a ~ ~ ~ ~ ~ ~ \


divide variables into clusters clus_num 0; while L :f </> do clus_num clus_num + 1; Cluster clus_num = </> ;



while (L :f </ and Clusterclus_num L = DELETE(L, endwhile endwhile

= Clusterclus_numU

OVERLAP(Clusterclus_num, FIRST(L)) {FIRST(L)}; FIRST(L));


~~- - - - _1 - - H
54 ~ -

---t H~-- -1- -

-~ --

--. -




-. - -

allocate registers for one cluster of variables at a time for k = 1 to clus_num do V = Clusterk; E = BUILD_GRAPH(R, E' = MATCHING(G(R V); V, E));



+- -~1-1~H- rr, r2 r3

for each ej,j E E', where Vj E V and Tj E R do MAP[v] ) - T" I' endfor endfor

I v, , va
{ Vg


v,a } s , v" }

I v4

, V

Vk is to be used by the same functional units that also use variable Vm, thell sinn' the two variahl('s call share the intercollllections, it is desira.ble that Vk also he assiglled to Tn
111 a bipartite graph, the node set is partitioned into two disjoint subsets and every edge connects two nodes in different subsets. The graph depicted in Figure 8.13(b) is bipartite. It has two sets, the set of registers, R = {Tj, T2, T3, T4, TS} and the set of variables, V = {V3, vs, V7}' The graph also has a set of edges E = {e3,s, e3,7, e4,S, e4,7, eS,3, es,S, eS,7} returned by the function BUILD_GRAPH. The problem of matching each variable to a register is equivalent to the classic job-assignment problem. The largest subset of edges that do not have any common end-nodes is

r4 rs

{ va ' v.,

I V2


v~ }

Figure 8.13: Weighted bipartite-matching for register allocation: (a) sorted lifetime intervals with clusters, (b) bipartite graph for binding of variables in CiusteT2 after ClusteTj has been assigned to the registers,

defined as the maximal matching of the graph. The maximal edge-weight matching of a graph is the maximal matching with the largest sum of the weights of its edges. A polynominal time algorithm for obtaining the maximum weight matching is presented in [PaSt82]. The set E' indicates the assignments of variables to registers as determined by the maximum matching algorithm in function MATCHING. E' {eS,3, e3,S, e4,7} is indicated in the graph of Figure 8.13(b) with bold lines. After binding the second cluster of variables to the registers according to the matching solution, namely, V3 to rs, Vs to r3 and V7 to T 4, the algorithm proceeds to allocate the third cluster of variables, Vs, v9 and Vn, and so on. The final allocation of variables to registers is given in Figure 8.13(c).

exchange. In this approach, the modification to a datapath is limited to a swapping of two assignments (i.e., variable pairs or operation pairs). Assume that only operation swapping is used for the iterative refinement. The pairwise exchange algorithm performs a series of modifications to the datapath in order to decrease the datapath cost. First, all possible swappings of operation assignments scheduled into the same control step are evaluated in terms of the gain in the datapath cost due to a change in the interconnections. Then, the swapping that results in the largest gain is chosen and the datapath is updated to relied the swapped opNatiolls. This process is repeated until no amount of swapping results in a positive gain (Le., a further reduction in the datapath cost). Algorithm 8.5 describes the pairwise exchange method. Let DPcurrent represent the current datapath structure and DPwork represent a temporary datapath created to evaluate the cost of each operation assignment swap. The function COST(DP) evaluates the cost of the datapath DP. The datapath costs of DPcurrent and DPwork are represented by Ccurrent and Cwork. The procedure SWAP(DP, 0i, OJ) exchanges the assignments for operations 0i and OJ of the same type and updates the datapath D P accordingly. In each iteration of the innermost loop, CUTTentGain represents the reduction in datapath cost due to the swapping of operations in that iteration. BestGain keeps track of the largest reduction in the cost attainable by any single swapping of operations evaluated so far in the current iteration. This approach has two weaknesses. First, using a simple pairwise swapping alone may be inadequate for exploring all possible refinement opportunities. For instance, no amount of variable-swapping can change the datapath of Figure 8.8(b) to that of Figure 8.8( d). Second, the greedy strategy of always going for the most profitable modification is likely to lead the refinement process into a local minimum. While a probabilistic method, such as simulated annealing [KiGV83], can be used to cope with the second problem at the expense of more computation time, the first one requires a more sophisticated solution, such as swapping several assignments at the same time. Suppose operation 0i has been assigned to functional unit fUj and one of its input variables has been bound to register Tk. The removal of from fUj will not eliminate the interconnection from Tk to fUj unless no other operation that has been previously assigned to fUj has its input

The matching algorithm, like the left-edge algorithm, allocates a minimum number of registers. It also takes partially into consideration the impact of register allocation on interconnection allocation since it can associate weights with the edges.

Given a datapath synthesized by constructive or decomposition methods, its quality can be improved by reallocation. As an example, consider functional-unit reallocation. It is possible to reduce the interconnection cost by just swapping the functional-unit assignments for a pair of operations. For instance, if we start with the datapath shown in Figure 8.8( c), swapping the functional unit assignments for operations 03 and 04 will reduce the interconnection cost by four 2-to-l multiplexers (Figure 8.8( d)). Changing some variable-to-register assignments can be beneficial too. In Figure 8.8(b), if we move variable 9 from register T2 to register Tl and h from r3 to T4, we get an improved datapath with two fewer multiplexers (Figure 8.8(c)). The main issues in the iterative refinement approach are the types of modifications to be applied to a datapath, the selection of a modification type during an iteration and the termination criteria for the refinement process. The most straightforward approach could be a simple assignment


repeat BestGain = -00; Ccurrent = COST( D Pcurrend

for all control steps, s do for each OJ, OJ of the same type scheduled into s, i ;f j do DPwork = SWAP(DPcurrent, OJ, OJ); Cwork = COST(DPwork); CurrentGain = Ccurrent - Cwork; if CurrentGain > BestGain then BestGain CurrentGain; BeslOp1 = OJ; BestOp2 = OJ; endif endfor endfor

some realistic architectures and their impact on the interconnection complexity for allocation. We described the basic techniques for allocation using a simple model that assumes only a straight-line code behavioral description and a simple point-to-point interconnection topology. We discussed the interdependencies among the subtasks that can be performed in an interleaved manner using a greedy constructive approach, or sequentially, using a decomposition approach. We applied three graph-theoretical algorithms to the dat~path ~ocat.ion problem: clique partitioning, left-edge algorithm and weIghted blpartltematching. We also showed how to iteratively refine a datapath by a selective, controlled reallocation process. The greedy constructive approach (Algorithm 8.1) is the simplest amongst all the approaches. It is both easy to implement and computationally inexpensive. Unfortunately, it is liable to produce inferior designs. The three graph theoretical approaches solve the allocation tasks separately. The clique-partitioning approach (Algorithm 8.2) is applicable to storage, functional and interconnection unit allocation. The left-edge algorithm (Algorithm 8.3) is well suited for storage allocation. The bipartite matching approach (Algorithm 8.4) is applicable to storage and functional unit allocation. Although they all run in polynomial time, only the left-edge and the bipartite matching algorithms guarantee optimal usage of registers, while only the clique-partitioning and the bipartite-matching algorithms are able to take into account the impact on interconnection during storage allocation. The iterative refinement approach (Algorithm 8.5) achieves a high quality design at the expense of excessive computation time. The selection and reallocation of a set of behavioral entities have a significant impact on the algorithm's convergence as well as the datapath cost. Future work ill dat.apat.h allocat.ioll will lIel~d 1.0 focus 011 illlproving the allocation algorithms in several directions. First, the allocation algorithms can be integrated with the scheduler in order to take advantage of the coupling between scheduling and allocation. As we have pointed out, the number of control steps and t~I'! rl~quired number of functional units cannot accurat,,]:; r~ft~ct th~ ,j'"ign quality. :\ f".st ;,.lloc,,1torcan quickly provid" th~ sC[I",iu]"r with m(~r" informati(~n (".g., .,uJrage sharing between operands of different operations, interconnection cost and data-transfer delays) than just these two numbers. Consequently, the

if BestGain DPcurrent endif until BestGain

> 0 then
= SWAP(DPcurrent, BestOp1, BestOp2);

:s 0

variables assigned to rk. Clearly, the iterative refinement process has to approach the problem at a coarser level by considering multiple objects simultaneously. \Ve must take into account the relationship between entities of different types. For example, the gain obtained in operation reallocation may be much higher if its input variables are also reallocated silllllll.all<'OIlS!V.I'll<' slrat.l'l-!;Y rpallocat.illl-!;a I-!;rollpof diffl'l'I'ut. t.ypl's o[ ' of ('lItili('s call 1)(' as silnpll' as a greedy constructive algorithm (Section 8.4) or as sophis1.ica.1('c!as a hranch-alld-bound search (e.g., STAR [TsHs90]).

In this chapter, we described the datapath allocation problem, which consists of four basic subtasks; unit selection, functional-unit binding, storage binding and interconnection binding. We outlined the features of

scheduler will be able to make more accurate decisions. Second the algorithms must use cost functions based on physical design char;cteristics. This will give us more realistic cost functions that closely match the actual design. Finally, allocation of more complex datapath structures must be incorporated. For example, the variables and arrays in the behavioral description could be partitioned into memories, a task that is complicated by the fact that memory accesses may take several clock cycles.

6. Extend the weighted bipartite-matching algorithm to handle inputs with conditional branches. Does it still allocate a minimal number of registers? 7. Show that every multiplexer versa. can be replaced by a bus and vice

1. Using components from a standard datapath library, compare the delay times for the following datapath components: (a) a 32-bit 4-to-1 multiplexer, (b) a 32-bit 2-to-1 multiplexer, (c) a 32-bit ALU performing addition, (d) a 32-bit ALU performing a logic OR, (e) a 32-bit floating-point multiplier, (f) a 32-bit fixed-point multiplier, and (g) a 32-bit 16-word register file performing a read. 2. Suppose the result of operation OJ is needed by operation OJ and the two operations have been scheduled into two consecutive control steps. In a shared-bus architecture such as that of Figure 8.6, OJ will be reading its input operands before OJ has written its output operand. What changes are required in the target architecture to prevent OJ from getting the wrong values? 3. Extend the design in Figure 8.7 to support chaining of functional units. Discuss the implications of this change on allocation algorithms. 4. *Extend the left-edge algorithm [KuPaS7] to handle inputs with conditional branches. Does it still allocate a minimal number of registers? 5. Prove or disprove that the bipartite-matching method [HCLH90] uses the same number of registers as the left-edge algorithm does in a straight-line code description.

8. In a bus-based interconnection unit, assume that there is one level of tri-state buffers from the inputs to the buses and one level of multiplexers from the buses to the outputs. Redesign the interconnection unit of Figure 8.6 (which, by coincidence, uses no multiplexers at all), using the same number of buses so that the data transf(,r capability of our five-transfer example is preserved and the number of tri-state buffers is minimized (at the expense 01 more multiplexers, of course). 9. *Some functional units, such as adders, allow their two input operands to be swapped (e.g., a + b = b + a). Suppose both the functional-unit and storage allocation have been done. Design an interconnection-allocation algorithm that takes advantage of this property. 10. Using a component library and a timing simulator, measure the impact of bus loading on data-transfer delay. Draw a curve showing the delay as a function of the number of components attached to the bus. 11. Show an example similar to that of Figure 8.8, where swapping of a pair of variables reduces the worst case delay for the data transfer. Assume that latches exist at both the input and output ports of all functional units. 12. Given a partially allocated datapath in which both the functionalunit allocation and the register allocation have been done, design an interconnection-allocation algorithm that minimizes the maximal bus load for a non-shared bus-based architecture. 13. *Design an interconnection-first algorithm for datapath allocation targeted towards a bus-based architecture. That is, assign data transfers to buses before register and functional-unit allocation. Compare this algorithm with the interconnection-last algorithms. Does this approach simplify or complicate the other two tasks?