Beruflich Dokumente
Kultur Dokumente
THESIS Presented to the Faculty of the Graduate School of The University of Texas at Austin in Partial Fulllment of the Requirements for the Degree of MASTER OF SCIENCE IN ENGINEERING
To Krishna, for being my quest. . . To Amma and Daddy, for showing me the way. . .
Acknowledgments
Id like to thank my advisor, Dr. Jacob Abraham for his invaluable support and guidance through the course of this work. His novel ideas, infectious enthusiasm and intellectually stimulating discussions kept me motivated and encouraged through the entire course of my Graduate Studies. Thank you Sir, for your rm belief in me. It kept me going in the most trying times. Id also like to thank my colleague and fellow PhD student, Vinod Viswanath, for his support and assistance through my Masters. His experience, insight, resourcefulness, skills and alacrity have been a priceless source of inspiration and and help in obtaining this degree. Without his contribution, I dont imagine I could have got this far. Id like to thank Linda, Andrew, Shirley and Ruth for their promptness and efciency in matters that required their attention. Id also like to thank my lab-mates for their co-operation. Id like to thank my friends Siddarth and Kunal, for bringing a lot of joy in my life in the U.S. Lastly, Id like to thank my parents and sister for making me who I am.
We present a novel technique to formally verify arithmetic circuit designs at the Register Transfer Level (RT-Level). Our technique involves translation of circuits in Verilog RTL to Term Rewriting Systems (TRS). We verify the target design using a simple, correct reference design with the same functionality. We translate the two designs into TRSs. We introduce a theory of equivalence of two TRSs. Using this theory, we prove the correctness of the target design with respect to the reference design. Our tool, Verire automates the entire technique. We demonstrate the applicability of this technique on adder designs. We illustrate the power of this technique when applied to multiplier verication. We show a detailed proof of correctness, as output by our tool, of a
Wallace Tree multiplier. We also show the extension of our tool to modications of standard multipliers, with BISMUL, a modied Booth multiplier.
vi
Table of Contents
Acknowledgments Abstract List of Tables List of Figures Chapter 1. Introduction 1.1 Motivation and Prior Work . . . . . . . . . . . . . . . . . . . . . . 1.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Organization Of Thesis . . . . . . . . . . . . . . . . . . . . . . . . Chapter 2. An Overview of our Verication Methodology 2.1 Choice of Representation . . . . . . . . . . . . . . . . . . . . . . .
v vi ix x 1 1 3 4 6 7
Chapter 3. Term Rewriting Systems 10 3.0.1 Example Term Rewriting System . . . . . . . . . . . . . . . 13 Chapter 4. Equivalence of Term Rewriting Systems 4.1 Denition of theory of equivalence of TRSs . . . . . . . . . . . . . 4.2 Alternative Denition and Proof for Equivalence of TRSs . . . . . . 4.3 Computing Comparison points . . . . . . . . . . . . . . . . . . . . Chapter 5. Verire : A fully automated proof generator 16 16 20 22 24
Chapter 6. Arithmetic Circuit Verication 26 6.1 Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 6.2 Shifters and Comparators . . . . . . . . . . . . . . . . . . . . . . . 33
vii
Chapter 7. Multiplier Verication 7.1 Booth Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 BISMUL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Wallace Tree Multiplier . . . . . . . . . . . . . . . . . . . . . . . .
34 34 38 43
Chapter 8. Results and Discussion 48 8.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 8.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Chapter 9. Conclusions 53 9.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Appendices Appendix A. An ACL2 Implementation of our Technique A.1 Project Description . . . . . . . . . . . . . . . . . . . . . . . . . A.2 Verication of the 74181 ALU in ACL2 . . . . . . . . . . . . . . A.2.1 Using ACL2 . . . . . . . . . . . . . . . . . . . . . . . . . A.2.2 The 74181 ALU . . . . . . . . . . . . . . . . . . . . . . . A.3 Applying the technique to 16 bit adder operation of 74181 in ACL2 A.4 Verication of a RISC pipeline using our technique . . . . . . . . Appendix B. Verilog Code for the Shift-and-Add Multiplier 56 57 57 57 57 58 60 67 72 81 89 106 111
. . . . . .
Appendix C. Verilog Code for the Booth Multiplier Appendix D. Verilog Code for the BISMUL Multiplier Bibliography Vita
viii
List of Tables
Partial product terms of the booth multiplier. . . . . . . . . . . . . . 36 The partial product terms in a BISMUL . . . . . . . . . . . . . . . 41 The inputs of each PPSEL . . . . . . . . . . . . . . . . . . . . . . 41 Comparison of execution times of Verire against two commercial equivalence checkers for a Booth multiplier of varying sizes. In each case the golden model was a shift and add multiplier of the corresponding size. . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Comparison of execution times of Verire against two commercial equivalence checkers for a Wallace Tree multiplier of varying sizes. In each case the golden model was a shift and add multiplier of the corresponding size. . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Comparison of execution times of Verire against one commercial equivalence checker assisted by manual comparison points. Results are shown for both booth and wallace tree multipliers. . . . . . . . . 50
8.2
8.3
ix
List of Figures
2.1 6.1
Relevant subset of allowed Verilog constructs. Verilog key words are in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Proof of correctness of the Carry Lookahead adder compared against the Ripple Carry Adder. represents terms of the RCA. represents terms of the CLA. The variable within is the observable variable updated after a set of rewrites. represents the expression equivalence between the observable terms of the two systems at each comparison point. The rewriting and the expression equivalence form the two engines of the Vprover. . . . . . . . . . . . . . . . . 32 Architecture of a Booth multiplier. . . . . . . . . . . . . . . . . . . Proof of correctness of the Booth multiplier compared against the Shift&Add multiplier. represents terms of the Shift&Add multiplier. represents terms of the Booth multiplier. R represents the rules of the Shift&Add multiplier at every stage (Rule x and Rule y). R represents the corresponding Booth multiplier rules (Rule a ...Rule h). The variable within is the observable variable updated after a set of rewrites. Here it is product. represents the expression equivalence between the observable terms of the two systems at each comparison point. The rewriting and the expression equivalence form the two engines of the Vprover. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Architecture of a BISMUL. . . . . . . . . . . . . . . . . . . . . . . Architecture of a -bit Wallace Tree Multiplier. . . . . . . . . . . . 35
7.1 7.2
7.3 7.4
39 39 43
Chapter 1 Introduction
Verication of large designs at the Register Transfer Level (RT-Level) is a widely studied problem. Verication of arithmetic circuits, especially integer multipliers presents an interesting nexus of challenge and opportunity. State-ofthe-art verication techniques cannot verify these large circuits. In this work, we present a verication technique to formally verify large arithmetic circuit designs at the RT-level. We propose and describe the theory and technique that performs equivalence checking between two RT-level designs.
only at the gate level. Though they can handle many generic circuits at this level, they cannot verify large multipliers. Also, the two designs being compared have to be very similar for reasonable performance. Completely disparate designs are not trivial to verify against each other. There have been some efforts to verify large arithmetic circuits like multipliers using theorem provers and proof checkers [2], [8], [19], [26]. These are few in number since multiplier verication using a theorem prover is a hard problem. This is because considerable user expertise and ingenuity is required to prove the lemmas pertinent to the design [28]. If the multiplier design builds on designs (like adder and shifter) that exist in the rule database, the lemmas corresponding to the multiplier itself can be very intricate and numerous. If, however, the multiplier is built without any of these infrastructural lemmas that exist in the rule database, then the task of generating and proving lemmas is extremely complicated [20]. Additionally, theorem provers do not handle Verilog designs directly. Although there are translators that convert Verilog to ACL2, a large number of infrastructural lemmas for handling the translated RTL need to be written and proved [20]. Kapur et al provide a methodology for specifying and verifying a family of parameterized multiplier circuits within the framework of the Rewrite Rule Laboratory (RRL) theorem prover [8], [18]. These circuits, however, are high level functional abstractions and are not close to the implementation level. Also, the generality of these circuits is limited to their parametric size variations. The technique is not suitable for a generic design. 2
We split the space of multiplier designs into standard (base) designs and the modied (optimized) designs. Standard designs are widely used multiplier designs like Booth, Wallace Tree and Array multiplier. We prove the equivalence of optimizations to these designs against the standard designs and prove the equivalence of the standard designs against a simple golden Shift-and-Add multiplier. We demonstrate the application of the technique to adders. In order to illustrate the actual power of our technique, we present the proof of correctness of non-trivial examples, a
Shift-and-Add multiplier. We also show the intuition for proving BISMUL [30] an optimized Booth multiplier, correct, against the Booth multiplier. We also show a comparison of our approach with existing Boolean equivalence checkers. All of the analysis performed by the tool is done on terms composed of RTL operators (e.g. bitwise-and, left-shift, etc.) as opposed to the Boolean netlist level the level at which equivalence checkers operate. Terms are more concise and more efcient to manipulate than netlists. The potential downside is the incompleteness of the analysis, but we have found that with proper decomposition, the requisite analysis is reasonable to implement.
technique. We present our tool Verire in Chapter 5. In Chapter 6, we show the application of our technique and the efciency of our tool in the domain of (integer) arithmetic circuits. We also provide an illustrative proof of correctness for Carry Lookahead Adders in this chapter. We provide a detailed correctness proof for the Booth and Wallace Tree multipliers and outline the proof for BISMUL in Chapter 7. In Chapter 8 we provide the results of comparing our tool against commercial equivalence checkers for large multiplier designs. The chapter also discusses the merits and demerits of our technique in relation to other state-of-the-art techniques. Finally, conclusions and future work are presented in Chapter 9.
We present an RT-level equivalence checking technique for large arithmetic circuits. A simple RTL design with the same functionality as the arithmetic circuit to be veried (revised design) is the golden design. For instance, the ripple carry adder would be the golden design for verication of all target adder designs. Our proposed technique involves the translation of the golden and the revised RTL designs to an intermediate, well-known formalism, Term Rewriting Systems (TRS) [21]. We treat Verilog as a programming language with deterministic semantics. We model Verilog program transformations by term rewriting. We have created a framework for this translation, that we provide here. We have developed the theory and methodology for checking and establishing the equivalence of TRSs. The equivalence checking of the two RTL designs is done by by checking their corresponding TRSs for equivalence. The notion of equivalence used is input/output equivalence. Our verication methodology is as follows. We translate the target design that has to be veried into a TRS. We translate a reference design that performs the same function as the target design, and is known to be correct, into its corresponding TRS. We prove that the two TRSs are equivalent by using our notion of observation
equivalence. This technique can be used to verify very large multiplier designs, since it scales well with the size of the design.
can be interpreted and expressed as corresponding results in the representation. In the past, TRS has been used as an intermediary representation providing a theorem proving framework for hardware verication [27]. Our choice of TRS as an abstraction is also due to its expressive framework that allows for convenient hierarchical representations. Since our method involves the translation of Verilog programs that have a hierarchical structure, the mapping between the two domains is very intuitive. Also, TRS lends itself to accurate and detailed behavior of hardware systems, by allowing the modeling of concurrency and nondeterminism. The relation between Hardware Description Languages and TRS is well established with the development of the TRAC [13]. Our technique involves the reverse mapping between these two behavioral description systems. While the hardware synthesized by the TRAC is used to build correct designs, our aim is to verify existing designs. We have identied a subset of the Verilog HDL that we can translate into TRSs. This subset is synthesizable by commercial tools. We follow the IEEE draft standard for Synthesizable RTL Verilog/VHDL [11]. The circuit being translated needs to conform to this subset of Verilog. We provide a grammar for this synthesizable subset of Verilog in Figure 2.1 The framework for translating Verilog designs into TRSs will be detailed in the next section.
module denition ::= module module name [parameter list] declaration list statement list endmodule declaration list declaration variable type statement list statement ::= declaration declaration list declaration ::= variable type identier list [:= expression] ::= input output wire reg ::= statement statement list statement ::= always statement if statement module call statement variable assignment statement
Figure 2.1: Relevant subset of allowed Verilog constructs. Verilog key words are in bold.
We present a brief introduction to TRSs in this section. A Term Rewriting System can be represented by of terms and
where , with
. A TRS is
conuent if any divergence in rewriting is eventually joined. A normal form is a term which cannot be rewritten any further. Termination ensures the existence of normal forms, while conuence ensures their uniqueness. We translate circuit designs implemented in Verilog RTL [24] to Term Rewriting Systems. We have tried to translate arithmetic circuits using this framework. Currently, we are working in the domain of combinational circuits, and do not consider sequential circuits. Therefore, we do not translate non-blocking assignments in Verilog that can model non-deterministic semantics. In this domain of circuits, Verilog can be considered as having deterministic imperative programming language semantics. Some steps have been taken in this direction [16] for C-like languages. Some other steps have been taken for formalizing the semantics of Verilog [1], [29], [32].
10
We follow a set of rules for the translation of Verilog into Term Rewriting Systems. Every rewrite rule is a structure-preserving program transformation in the Verilog design. (Here we use the terms design and program interchangeably, since a design written in Verilog is viewed as a program). Collectively, the set of all such rules can be viewed as the Term Rewriting System for the Verilog design. The left hand side of a rule is matched if the variables or the actual parameters are the same. If matched, these variables/function symbols are rewritten, according to the corresponding update for that variable in the Verilog program. Therefore, all it assignments and module calls in Verilog are modeled as rewrite rules. The variables of the Term Rewriting System are all the variables declared in the Verilog design (i.e. inputs, outputs, wires, registers). Every module in Verilog is modeled by a function symbol. The parameters of the function are all the variables of the module as well as other module instantiations. Module instantiations are denoted by function symbols. So a module is a term, with subterms as variables and other module (function) calls. A term is rewritten till no more rewriting is applicable to it. It is then said to be a normal form. Since our TRSs model Verilog designs, we allow for variables to have a specic bit width and be bitwise addressable. Our rewriting is directed toward obtaining the symbolic value of all the outputs of the Verilog design. We dene the normal form of the TRS with respect to these output variables. If all the output bits (specied in the bit width) have been rewritten into (i.e they appear on the right hand side of a rewrite rule), the term corresponding to the output variables is said to have reached a normal form. Therefore, our rewriting strategy is oriented
11
toward obtaining an expression (or symbolic value) for all the bits of all the output variables. In the case of arithmetic circuit designs, explicit directing of the rewriting is usually not necessary. This is because these circuits do not have multiple paths leading to the outputs. In well-behaved systems, the Verilog programs are deterministic and will produce a single value for each of the outputs. The corresponding TRS will have a unique normal form. The rewriting for such systems is terminating. The intuition for the proof is that the lexicographic path ordering for the TRS is the decreasing
number of unknown bits of a variable in every successive rewrite. The TRS terminates since, for every rule
and substitutions
, the condition
holds.
The termination function itself is a monotonic homomorphism [10]. Although the rewrites are directed toward obtaining values for the output variables, if there is more than one output variable as a subterm of a term, there may be more than one rule that is applicable to that term. In other words, two output variables may concurrently get rewritten into, thereby forming a critical pair. The well-behaved arithmetic circuits we are looking at, are designed to produce a unique result as an output. Therefore, the critical pairs will derive the same term, (joinable) as mentioned in the termination proof argument. This term may be the nal output (or the unique normal form) or intermediate points of join. Since all the critical pairs of the rewrite system can be shown to be joinable, the TRS is locally, and hence globally conuent. Since we prove that our systems are terminating and conuent, we prove that the TRSs translated from Verilog HDL are convergent.
12
3.0.1 Example Term Rewriting System We illustrate an example that codies a Verilog program into a TRS.
module addmux(inA, inB, opt, sel, out); input inA, inB, opt, sel; output out; reg out; wire addout; adder add1 (addout, inA, inB); if (sel) out = addout; else out = opt; endmodule
13
The terms in the TRS for the addmux module are: addmux (inA, inB, opt, sel, out, addout, add1(addout, inA, inB)) add1 (S, A, B) The rules in the TRS are: addmux (inA, inB, opt, sel==1, out, addout, add1)
addmux (inA, inB, opt, sel==0, out, addout, add1) addmux (inA, inB, opt, sel, opt, addout, add1)
Each subterm obtains its actual parameters from the calling term by a rule. To faithfully model the Verilog semantics, two rules are used to model every subterm (module instantiation). These rules, that associate a module instance with the set of variables that form its actual parameters, are obtained by a topological sorting of the Verilog code. Since every module instance (subterm) communicates with only the instantiating module (calling term), the updates to the variables that have been passed to the subterm have to be reected in the variables of the calling term. Therefore, every module instantiation is also associated with an updating rule. We assume that the input Verilog is race-free (i.e. no multiple parallel assignments for the same signal), and loop-free (i.e. no cyclic dependencies between combinational always blocks). The resulting structural TRS will then be conver-
14
gent, i.e. conuent (due to race-free assumption) and terminating (due to loop-free assumption). Note that for this structural TRS, the Verilog RTL operators are uninterpreted; the structural TRS is only used to construct terms dening the values of signals in terms of other signals.
15
Our goal is to prove the equivalence of an implementation and a specication design. We assume these designs are (or can be translated into) combinational Verilog RTL modules which dene a mapping from their inputs to outputs. The equivalence of these mappings is the target of the analysis. This target is shared with the combinational equivalence checking tools which are now ubiquitous. The monolithic verication problem is intractable in general and (similar to equivalence checkers) we use signal names in the two modules as guidance in decomposing this equivalence proof.
we dene the to be the set of signal functions and variables in as and for
of
, we have
Given
use the word signal to refer to Verilog variables in RTL modules, and reserve variable for variables in TRSs
16
and
as
name. In cases of multiple assignments to the same signal in either module, we may need to adjust the names assigned to the multiply-assigned signals in order to ensure correspondence (we will present our approach to dealing with this problem later in this section). We wish to prove
lithically is prohibitively expensive in general, so we compute a set of comparison point signal functions and make use of the following property to transfer the
substitution for
, one can prove by induction following the iterative expansions and using
to relieve the induction hypothesis and substituting equals
by proving
instead where
. The
proofs together:
Theorem 2.
and revised designs (including the required equivalent names for inputs and outputs). Comparison points for signals that have the same base name but are multiplyassigned in either design, are determined by the following heuristic. We analyze the set of bits which will be assigned a non-constant value in each assignment to the base signal. We rename the assigned signals to match-up the assignments which assign the same number of non-constant bits in the different designs. This is simply a heuristic which appears to work well for arithmetic circuits. The user can bypass this heuristic by having unique signals in every assignment and only introducing comparison points for signals with the same name. We now turn our attention to proving , we iterate through each signal
. In order to check
, and compute
, compute
rules which codify various identities about the RTL operators. For example, one may introduce an absorption and association rule for &, as well as rules for reducing arithmetic and left-shifts: (x & x) ---> x
have multiple assignments
18
((x & y) & z) ---> (x & (y & z)) (x << 3) ---> (+ (x << 2) (x << 1) (x << 1)) (- (x << 1) x) ---> x ((x << 1) << 1) --> (x << 2) We denote this term simplication with the function
which maps
a term to a reduced term which is equal under all substitutions to . We then deduce
when
a later section, but we note that the procedure is not complete in determining Instead
be analyzed. The decomposition of the equivalence check using comparison points and incremental renement lessens the requirements on efciency and sufciency for the function
We now demonstrate how the procedure works for the following simple golden and revised designs: module G(input in, output reg out); always@* begin out = in << 1; out = out << 1; out = out << 1; out = out << 1; end endmodule 19 // out1 // out2 // out3 // out4
module R(input in, output reg out); always@* begin out = in << 2; out = out << 2; end endmodule // out2 // out4
Since out is multiply-assigned in both modules, our heuristic analyzes the set of non-constant bits in each assignment and deduces the comparison points dened by the names in the comments to the right of the assignments above. We then have the following set of comparison points
. The
comparison then proceeds by checking out2 and out4. For out4, we get the following terms:
and . We get and and thus . The check for out2 proceeds in the similar fashion with in() instead of .
represent the set of all observable terms in both TRSs. Assuming a mapping is provided between the terms of
20
observable term
form an ob-
servable pair. The observable terms are typically the outputs in a Verilog design. Since the variables need to be bit addressable, we dene a function
that has a value. We compare the observable terms of the two TRSs at specic
rewrites in the respective TRSs, such that for every observable pair ned in and ,
variables) in the two systems are compared when the same number of bits of an output variable have been updated in both the systems. The symbolic expressions obtained by rewriting these bits of the variable in both the systems are compared against each other. Theorem 3. Let each terms. Let
bits wide. Let and denote the initial (bottom) value of the and . Then, every comparison point is after
are
and will
have rewritten to the same number of bits. Let this value of bits be . We are given that at
,
successive comparison point, the next forms are reached. The normal form normal form is the term where point that compares
and the
comparison points, that all by bitwise equivalence of and , the two TRSs and are observationally equivalent.
of the same number of bits have been obtained, the rewrite step is considered a comparison point. The expressions for the two sets of observed variables are now checked for equivalence. Consider, for example, a Carry Lookahead Adder (CLA) design being veried against a golden Ripple Carry Adder (RCA) design. When the observation function is applied to these designs, the range will be the output variables of the design. So, the Sum and Carry variables will be the observed variables. In the corresponding TRSs for these two systems, let us look at the number of rewriting steps taken by each system to rewrite a (symbolic) value for a bit in the Sum and Carry variables. In the RCA TRS, in a single rewriting step, the values for one bit of Sum and Carry are obtained. In the CLA TRS, a single rewriting step produces the values for four bits of the Sum and Carry variables. So, a comparison point is identied after one rewrite step in the RCA and four rewrite steps in the CLA. At every such comparison point that is identied, the symbolic expressions of the two TRSs are compared and checked for equivalence. The computation of comparison points automatically is an important contribution of our technique. The computation of comparison points is shown in greater detail in the proofs outlined in Chapter 6.
23
Verire is a fully automated tool which implements the generic proof technique described in Chapter 4. There are two distinct parts to the tool, viz., a Verilog to Rewriting Systems translator (Vtrans) and a proof engine (Vprover). Vtrans is a compiler which accepts synthesizable Verilog as input and translates it to a Term Rewriting System. The translator automatically identies the module hierarchy and constructs a TRS for the entire design. Vprover automatically generates proofs by using the notion of TRS equivalence between two Term Rewriting Systems. The reference TRS and the revised TRS are inputs to the proof engine along with the observation function of each TRS. The mapping between the terms in the two TRSs is also predened. With these inputs, the tool automatically generates a proof, or returns an error trace if it cannot establish the proof. The Vprover is an iterative engine that checks if the condition for all rules is true in every iteration. If a condition is satised, the corresponding term is rewritten. If the conditions for more than one rule are satised, then the rewrites occur concurrently. Vprover computes the intermediary comparison points automatically. Us24
ing a set of directives, it generates proofs for the observed terms at every comparison point. The symbolic values of the observed terms are compared by comparing the expressions that have been generated by the rewrites/substitutions. In order to establish expression equivalence, the tool maintains a database of statically pre-veried expression minimizations. This set of minimizations is not complete. Additions need to be made whenever the rewriter cannot minimize an expression due to an insufcient database of minimization heuristics. Verire was implemented in C++ and was used to prove many multiplier circuits. The tool can automatically generate proofs for standard multiplier designs like the Booth multiplier, Array multipliers and Tree multipliers. It can also automatically generate proofs for multiplier designs that are modications of these standard designs.
25
In this section, we show the application of our technique and the efciency of our tool in the domain of (integer) arithmetic circuits. Arithmetic circuits can be classied broadly into adders, shifters, comparators and multipliers. We illustrate the technique with respect to adders, and show how it works for modied shifters and comparators. We present all our experimental results with regard to multipliers. Due to the space constraint, we do not present the generated proofs for all the circuits mentioned, but present the adder proof as a representative example.
6.1 Adders
We illustrate our verication technique for verifying the functionality of a 16-bit Carry Lookahead Adder. We use a simple ripple carry adder as the golden design for adders. It adds two vectors by doing a bitwise xor and generates a corresponding carry. The Verilog code for a 16-bit ripple carry adder design is shown below.
module rca16bit(A, B, Cin, S, Cout); input [15:0] A, B; input Cin;
26
output [15:0] S; output Cout; reg S, Cout; wire [15:0] Carry; rca1bit rca1bit0(A[0], B[0], Cin, S[0], Carry[0]); R1,R2
rca1bit rca1bit1(A[1], B[1], Carry[0], S[1], Carry[1]); R3,R4 . . . rca1bit rca1bit15(A[15], B[15], Carry[14], S[15], Cout);R31,R32 endmodule module rca1bit(a, b, cin, s, cout); input a, b, cin; output s, cout; assign cout = a&b endmodule assign s = a b&c c&a; b c; R33 R34
Each instance of therca1bit () function is a subterm, with its own set of actual parameters. Every such subterm obtains its actual parameters from the calling term by a rule. These rules, that associate a module instance with the set 27
of variables that form its actual parameters, is obtained by a topological sorting of the Verilog code. Since every module instance (subterm) communicates with only the instantiating module (calling term), the updates to the variables that have been passed to the subterm have to be reected in the variables of the calling term. Therefore, every module instantiation is also associated with an updating rule. Since there are 16 module instantiations in the RCA Verilog code, there will be 32 associated rules. Rules 33 and 34 correspond to the transitions that are dened for the rca1bit() term. These rules can rewrite any particular instance of this term, since they have been dened for it.
Rule 33: rca1bit(a, b, c, s, cout) Rule 34: rca1bit(a, b, c, s, cout)
rca1bit(a, b, c, a
c, cout)
rca1bit(a, b, c, s, a&b
b&c
c&a)
The above translation is done automatically by Vtrans, the translator in our tool. The target design, a Carry Lookahead Adder (CLA) is similarly translated from its Verilog implementation to a TRS. The Verilog code for the CLA is shown below.
module cla16bit (A, B, Cin, S, Cout); input [15:0] A, B; input Cin; output [15:0] S; output Cout; reg S, Cout; wire C3, C7, C11;
28
fastcarry fc (A, B, Cin, C3, C7, C11); cla4bit cla0 (A[3:0], B[3:0],Cin, S[3:0]); cla4bit cla1 (A[7:4], B[7:4], C3, S[7:4]); cla4bit cla2 (A[11:8],B[11:8],C7,S[11:8]);
cla4bit cla3 (A[15:12], B[15:12], C11, S[15:12]); R9,R10 endmodule module cla4bit (a, b, cin, s); input [3:0] a, b; input cin; output [3:0] s; wire [3:0] c; assign c[0] = g[0] assign c[1] = g[1] p[0]&cin; g[0]&p[1] R11 R12
g[1]&p[3]&p[2] g[2]&p[3]&p[2]&p[1]; p[3]&p[2]&p[1]&p[0]&cin; assign s[0] = a[0] assign s[1] = a[1] assign s[2] = a[2] assign s[3] = a[3] b[0] b[1] b[2] b[3] c[0]; c[1]; c[2]; c[3]; R15 R16 R17 R18
29
PGgen pg0 (a[0], b[0], p[0], g[0]); PGgen pg1 (a[1], b[1], p[1], g[1]); PGgen pg2 (a[2], b[2], p[2], g[2]); PGgen pg3 (a[3], b[3], p[3], g[3]); endmodule module PGgen (a, b, p, g); input a, b; output p, g; assign p = a b;
R27 R28
The cla4bit module is called times by the main module. There are four cla4bit blocks in the design. The cla4bit module computes four successive carries at a time. The sum for the corresponding four bits is calculated in this module. The cla4bit module also calls the PGgen module, that generates the Ps(propagated carries) and the Gs (generated carries) for the block. A module called fastcarry is called to calculate the input carry values (Cin, C[3], C[7], C[11]) for each of the four cla4bit blocks. The terms in the CLA TRS are:
cla16bit(A, B, Cin, S, Cout, cla4bit0(A[3:0],B[3:0],Cin,S[3:0]), cla4bit1(A[7:4],B[7:4],C3,S[7:4]), cla4bit2(A[11:8],B[11:8],C7,S[11:8]), cla4bit3(A[15:12],B[15:12],C11,S[15:12]),
30
fc()) cla4bit(a, b, cin, s, c, PGgen0(a[0], b[0], p[0], g[0]), PGgen1(a[1], b[1], p[1], g[1]), PGgen2(a[2], b[2], p[2], g[2]), PGgen(a[3], b[3], p[3], g[3])) PGgen(a, b, P, G) fc (A, B, C, C, C)
We have shown the rules that are generated from the Verilog by Vtrans as labels in the Verilog code, for the sake of clarity. As explained in the case of the RCA, every module call in Verilog generates two rules- one for instantiation and the other for updating. Rules 1 to 10 correspond to these rules in the cla16bit module. Rules 7 to 18 are dened for the cla4bit module. These compute the values of c[3:0] and use it to calculate s[3:0] as the R.H.S. Rules 19 to 26 are rules pertaining to module calls for the PGgen() module. We have not shown the rules pertaining to the fast carry block, fc, since the carries are calculated using the same type of rules as used in the cla4bit block. We dene an observation function for both the TRSs, whose range are the variables S and Cout. The comparison points for the two designs are computed as the transitions whose R.H.S is a bit of the observed variables, S and Cout. (We assume a mapping between the two TRSs that gives the name correspondence of the variables of interest). It must be noted that in the ripple carry adder TRS, every rewriting step (that includes the rules that instantiate and update the variable) updates only one bit of the sum, S. For instance, rules 1 and 2 in the RCA form 31
RCA
S[0] R1,R2
S[1] R3,R4
S[2] R5,R6
S[3] R7,R8
S[7:4]
R5,R6
S[11:8]
R5,R6
S[15:12]
R5,R6
CLA
R5,R6
S[7:4]
R5,R6
S[11:8]
R5,R6
S[15:12]
R1,R2,R3,R4
S[3:0]
Figure 6.1: Proof of correctness of the Carry Lookahead adder compared against the Ripple Carry Adder. represents terms of the RCA. represents terms of the CLA. The variable within is the observable variable updated after a set of rewrites. represents the expression equivalence between the observable terms of the two systems at each comparison point. The rewriting and the expression equivalence form the two engines of the Vprover. a single rewrite step that updates S[0]. However, in the CLA, every rewriting step (Rules 1 to 10 in the TRS for CLA) updates four bits of the sum, S. Therefore, a single step in CLA corresponds to four steps in the ripple carry adder. Comparison can take place only at the point where four bits of the ripple carry adder have been obtained. Vprover, the expression equivalence checker in our tool uses these directives to compute the comparison points automatically. At the rst comparison point, after S[3] is obtained in the two TRSs, the expressions contained in S[3:0] are compared. Since the rewriting in both the systems is directed toward obtaining the observed variables, the trace of the rewriting steps that rewrite these variables is maintained. Vprover uses a set of minimization heuristics to compare expressions and compare equivalence. It is intuitive to understand how the expressions generated are equivalent, by tracing the rewrite steps in both the TRSs that lead to the observed variable. A correspondence between the rules for the two TRSs is given below. Figure 6.1 explains this in an intuitive
32
manner. Rule 33 in the RCA TRS corresponds to Rules 15 to 18 in the CLA TRS, since the four bits of the sum are computed as an xor of the corresponding input operand and carry bits. However, the input carry terms in the two TRSs are different. The expression for Carry[2] in the RCA is obtained by applying Rules 1 to 6. The corresponding value in the CLA TRS, c[3], is computed by Rules 11 to 14. Expanding the rewrites of Rules 19 to 26, we get the value of the this fourth stage carry in terms of the Ps and Gs of the other previous stages. Rules 27 and 28 give the same expression in terms of bits of A and B, instead of Ps and Gs. Therefore, the expressions of the carry in both the TRSs turn out to be exactly equal. Once S[3] is veried, a similar procedure is used to verify S[7], S[11] and S[15]. The normal form of both the TRSs is reached when S[15] is computed. The two TRSs are thereby proved equivalent.
33
We consider the space of multipliers divided into standard and non-standard multipliers. The standard multipliers are the widely used, common multiplier designs like Booth, Wallace tree, Dadda Tree and Array multipliers. The non-standard multipliers have incremental optimizations made to these standard multiplier designs. We have extended our technique to cover the space of these two categories of multipliers. We illustrate our technique on the Booth multiplier and BISMUL, an optimization of the Booth multiplier.
34
Multiplier (mplier)
Multiplicand (mcand)
0 1 shift3
shift3 1 0
Shifted Multiplier
Shifted Multiplicand
select
Adder (adder)
To prove the functional correctness of the above design, we follow the technique explained in Chapter 2. We illustrate the proof using the outline of the proof provided in that section. We use a simple Shift-and-Add multiplier as the reference TRS for multipliers. It performs multiplication by generating partial products. It shifts the multiplicand left by one bit after every partial product calculation. The partial product of the current stage is set to the sum of the previous partial product and the shifted multiplicand of the current stage or 0, depending on whether the multiplier bit corresponding to the current stage is 1 or 0. The Verilog code of the Shift-and-Add calls a shift and an add module iteratively. The entire Verilog code for the Shift-and-Add multiplier is given in Appendix B. The target design here is the Booth multiplier discussed above. Vtrans ex35
multiplier bits 000 001 010 011 100 101 110 111
partial product generated : multiplicand pp0: pp1: pp2: pp3: pp4: pp5: pp6: pp7:
tracts its TRS from the Verilog code. In the case of the Booth multiplier, the PO needed to prove
is
product as explained in Chapter 2. A sketch of this proof (as output by the tool) follows. The rst comparison point in the proof is after bits of output (product) are updated in both TRSs. This is because the Booth updates bits of its product simultaneously as opposed to Shift-and-Add that updates its product sequentially. The output of the tool after the rst comparison point is as follows. Stage in the tool output represents the -th update of product.
Comparison Point 1: rules in the output are reproduced in pseudo-Verilog syntax and they correspond to the rewrite rules of the TRS as described in Chapter 2. For example, rules Reference.Stage 1.Rule x and Reference.Stage 1.Rule y in the TRS would be the rewrite rule product() ---> product() + if (y[0](), mcand(), 0).
The
36
Reference Model: Stage 1. Rule x: Rule y: Stage 2. Rule x: Rule y: Stage 3. Rule x: Rule y:
Shift-and-Add
Revised Model: Stage 1. Rule a: Rule b: Rule c: Rule d: Rule e: Rule f: Rule g: Rule h:
Booth
product = product + 0 product = product + mcand product = product + mcand<<1 product = product + mcand<<1 + mcand product = product + mcand<<2 product = product + mcand<<2 + mcand
if ( y[0]& y[1]& y[2]) if (y[0]& y[1]& y[2]) if ( y[0]&y[1]& y[2]) if (y[0]&y[1]& y[2]) if ( y[0]& y[1]&y[2]) if (y[0]& y[1]&y[2]) if ( y[0]&y[1]&y[2]) if (y[0]&y[1]&y[2])
The expressions generated from both the TRSs from the rst comparison point are displayed with their corresponding rules. For instance, Reference. Stage 1.Rule x is product = product + mcand if (y[0]). Correspondence at the comparison point is established by a case-by-case analysis of the rules. Every encoding of the Booth multiplier is compared to the
37
Shift-and-Add, with the same conditions. For instance, Revised Stage 1.Rule a gives the expression for the partial product generated for the Booth encoding . Applying the condition y[0]y[1]y[2] = 000 on the Shift-and-Add, corresponds to rules Reference.Rule 1y, Reference.Rule 2y, and Reference. Rule 3y. Such an analysis is performed for all cases. The reduce() function is applied by Vprover to simplify corresponding terms at Comparison Point 1. An outline of this simplication as output by the tool is given below.
Correspondence Revised Rule 1a == Rule 1b == Rule 1c == Rule 1d == Rule 1e == Rule 1f == Rule 1g == Rule 1h == (after case analysis): Reference Rule 1y, Rule 2y, Rule Rule 1x, Rule 2y, Rule Rule 1y, Rule 2x, Rule Rule 1x, Rule 2x, Rule Rule 1y, Rule 2y, Rule Rule 1x, Rule 2y, Rule Rule 1y, Rule 2x, Rule Rule 1x, Rule 2x, Rule
3y 3y 3y 3y 3x 3x 3x 3x
The proof now proceeds to subsequent comparison points iteratively till the output is obtained in its normal form. Therefore explains the proof in an intuitive manner.
7.2 BISMUL
In order to improve the performance of multipliers, more complicated algorithms and designs are used. We consider a high-performance multiplier, BISMUL [30], that is a modication of the Booth multiplier. A radix-3 Booth multiplier 38
Compare Point 1
Shift&Add
Compare Point 2
P[5:3] R
Compare Point 3
P[8:6] R
Compare Point 21
P[63:61] R
Compare Point 22
P[64]
P[0] R
P[1] R
P[2] R
Booth
P[64]
P[2:0]
P[5:3]
P[8:6]
P[63:61]
Figure 7.2: Proof of correctness of the Booth multiplier compared against the Shift&Add multiplier. represents terms of the Shift&Add multiplier. represents terms of the Booth multiplier. R represents the rules of the Shift&Add multiplier at every stage (Rule x and Rule y). R represents the corresponding Booth multiplier rules (Rule a ...Rule h). The variable within is the observable variable updated after a set of rewrites. Here it is product. represents the expression equivalence between the observable terms of the two systems at each comparison point. The rewriting and the expression equivalence form the two engines of the Vprover.
ProductShift Register
Multiplier Register n 16 16 16 16
Multiplicand Register n
39
architecture using 3-bit scan with no overlap invariably generates dummy bits in the last 3-bit-scan. In BISMUL, this last 3-bit-scan is moved to the rst 3-bit scan, so that the dummy bits can be used for odd multiple generation. The sequence for bit scanning is shown in Table 7.2. The improvement in the multiplication speed in BISMUL is obtained by executing several Partial Product Selectors (PPSELs) in parallel. The shifting sequence of the multiplier decides the inputs to the PPSELs. These selected partial products are summed through carry-save additions. The architecture of a
architecture comprises Product Registers (PR), Partial Product Generators (PPG), Partial Product Selectors (PPS), Multiplexers and a Carry Save Adder (CSA). PPG is implemented according to Table 7.2. PPS consists of four PPSELs. Each PPSEL has 16-bit inputs, whose sequence is shown in Table 7.3. The operation of the BISMUL is as follows. In the rst cycle, PPG generates eight partial products and each PPSEL selects one partial product. The two dummy bits in the lower bit position in the rst three bits of PPSEL cause the selection of either 0, or four times the multiplicand (000 or 100). The partial products are added and stored in the PR. The partial products get generated in the rst cycle. Subsequent cycles perform the same operation as described. The entire Verilog code for the BISMUL multiplier is given in Appendix D. We prove the BISMUL correct by using the following technique. We perform a series of reductions on the BISMUL to reduce it to its standard design, the Booth multiplier. The standard Booth multiplier is already veried using the above technique. Hence, the given non-standard design can be proven correct. 40
Generation of Partial Product Terms P0: 0 P1: multiplicand P2: shift multiplicand left by 1 P3: add P1 and P2 P4: shift multiplicand left by 2 P5: add P1 and P4 P6: shift P3 to the left by one P7: subtract P1 from 8
Inputs Multiplier[54:52],[42:40],[30:28],[18:16],[6:4][0] Multiplier[57:55],[45:43],[33:31],[21:19],[9:7][1] Multiplier[60:58],[48:46],[36:34],[24:22],[12:10][2] Multiplier[63:61],[51:49],[39:37],[27:25],[15:13][3] Table 7.3: The inputs of each PPSEL
41
Verire extracts the corresponding TRSs from the BISMUL and Booth Verilog code. The tool compares the modules in the non-standard design (that derive from the modules in the standard design) to the corresponding modules in the standard design. The correspondence (and equivalence) between these derived modules of the non-standard design, and the modules in the standard design is established by the same method as described in Section 7.1 between the Booth and the Shift-and-Add. In order to prove the reduction of BISMUL to Booth, it is enough to prove that the changed (terms) modules in BISMUL are equivalent to the original Booth terms. In this case, the terms ppsel0, ppsel1, ppsel2, ppsel3, mux8to1 of the BISMUL form the revised design. The terms ppsel, shift3 of the Booth act as the corresponding reference design. Similarly, the terms productshift, carrysaveadder of BISMUL correspond to the ppsel,adder terms of Booth. Therefore, it is sufcient to prove the validity of each correspondence. We have veried the BISMUL using our technique. We have also veried the Wallace Tree multiplier Section 7.3. On similar lines, Array and Dadda Tree multipliers can also be veried using our technique. For each of these, we can also verify some modications to the standard designs. The terms in the TRS for the modied design are simplied to terms in the TRS for the standard design. The simplication is performed using the database of rules in Vprover. This set of rules is not exhaustive and may require manual intervention when presented with an entirely new design that does not build on the standard ones. However, for a large space of designs, it is completely automated. 42
Multiplier (y) 4
Multiplicand (mcand) 4
Partial Product Generator 8 Carry Save Adder tree 8 3:2 CSA 8 8 3:2 CSA 8 Fulladder 8 Product 8 8
show a verication outline in this section. In our design of the Wallace Tree, the partial products are generated without Booth encoding, so as to demonstrate the efcacy of the technique on disparate designs. This also means that the terms generated by the Wallace Tree multiplier TRS for the partial products are more complicated and large than the radix-3 Booth encoded multiplier discussed in Section 7.1. In the interest of readability of this proof, we demonstrate an illustrative version, that veries a
ates 4, 8-bit partial products (one corresponding to each bit of the multiplier). The partial products are added in a
The Shift-and-Add multiplier is used as the golden design in this proof. The working of the Shift-and-Add is described in Section 7.1. Vtrans translates the golden and the target designs into their corresponding TRSs. In the case of the Wallace Tree multiplier, (as in the case of the Booth), the PO needed to prove
of this proof (as output by the tool) follows. In this proof the comparison points are not generated for intermediate comparison and rewriting of the terms. This is because the Wallace Tree design that we have chosen does not compute bits of the product partially. So, in the case of the
multiplier, the Wallace Tree TRS terms are rewritten into a large, composite
term. The comparison point is after 64 steps of rewriting in the Shift-and-Add TRS and a single, monolithic rewrite step in the Wallace Tree TRS. For the current illustration of the proof on a
after 4 rewriting steps in the Shift-and-Add and one monolithic step of the Wallace tree.
Comparison Point 1: Reference Model: Stage 1. Rule x: product = product + mcand if(y[0]) Shift-and-Add
44
Rule y: Stage 2. Rule x: Rule y: Stage 3. Rule x: Rule y: Stage 4. Rule x: Rule y: [0.2in] Stage 1. Rule a: Rule b: Rule c: Rule d: Rule e: Rule f: Rule g: Rule h: Rule i: Rule j: Rule k: Rule l: Rule m: Rule n: Rule o: Rule p:
product = product + 0
if( y[0])
Revised Model:
product = 0 + 0 + 0 + 0
if ( y[0]& y[1]& y[2]& y[3]) if (y[0]& y[1]& y[2]& y[3]) if ( y[0]&y[1]& y[2]& y[3]) if (y[0]&y[1]& y[2]& y[3])
if ( y[0]& y[1]&y[2]& y[3]) if (y[0]& y[1]&y[2]& y[3]) if ( y[0]&y[1]&y[2]& y[3]) if (y[0]&y[1]&y[2]& y[3])
if ( y[0]& y[1]& y[2]&y[3]) if (y[0]& y[1]& y[2]&y[3]) if ( y[0]&y[1]& y[2]&y[3]) if (y[0]&y[1]& y[2]&y[3])
45
The expressions generated from both the TRSs from the rst comparison point are displayed with their corresponding rules. For instance, Reference. Stage 1.Rule x is product = product + mcand if (y[0]). Correspondence at the comparison point is established by a case-by-case analysis of the rules. Every encoding of the Wallace Tree multiplier is compared to the Shift-and-Add, with the same conditions. For instance, Revised.Stage 1.Rule a gives the expression for the partial product generated for the multiplier value . Applying the condition y[0]y[1]y[2]y[3] = 0000 on the Shiftand-Add, corresponds to rules Reference.Rule 1y, Reference.Rule 2y, and Reference.Rule 3y. Such an analysis is performed for all cases. The reduce() function is applied by Vprover to simplify corresponding terms at Comparison Point 1. An outline of this simplication as output by the tool is given below.
Correspondence Revised Rule 1a == Rule 1b == Rule 1c == Rule 1d == Rule 1e == Rule 1f == Rule 1g == Rule 1h == Rule 1i == Rule 1j == Rule 1k == Rule 1l == Rule 1m == Rule 1n == Rule 1o == Rule 1p == (after case analysis): Reference Rule 1y, Rule 2y, Rule Rule 1x, Rule 2y, Rule Rule 1y, Rule 2x, Rule Rule 1x, Rule 2x, Rule Rule 1y, Rule 2y, Rule Rule 1x, Rule 2y, Rule Rule 1y, Rule 2x, Rule Rule 1x, Rule 2x, Rule Rule 1y, Rule 2y, Rule Rule 1x, Rule 2y, Rule Rule 1y, Rule 2x, Rule Rule 1x, Rule 2x, Rule Rule 1y, Rule 2y, Rule Rule 1x, Rule 2y, Rule Rule 1y, Rule 2x, Rule Rule 1x, Rule 2x, Rule
3y, 3y, 3y, 3y, 3x, 3x, 3x, 3x, 3y, 3y, 3y, 3y, 3x, 3x, 3x, 3x,
Rule Rule Rule Rule Rule Rule Rule Rule Rule Rule Rule Rule Rule Rule Rule Rule
4y 4y 4y 4y 4y 4y 4y 4y 4x 4x 4x 4x 4x 4x 4x 4x
46
In this case, the Wallace tree TRS has reached its normal form at the end of the rst comparison point. We have not shown the verication of the Carry Save Adders (CSAs)as a part of this proof. The CSAs are veried separately, and the symbol that has been used in Rules 1a-p is assumed to be correct. An advantage of this technique, is also that the composition of RT level operators is possible. These operators can be uninterpreted and veried separately.
47
8.1 Results
We present the experimental results that we have obtained from our tool. We produce two sets of results, one on a radix 3 Booth multiplier and another on a Wallace Tree multiplier. We show the time taken by the tool for increasing sizes of these multipliers. We have tried to compare our tool to state-of-the-art equivalence checkers. Since the equivalence checkers are most efcient when comparing two gate level designs, we provided gate level implementations of the Booth and Wallace Tree designs as inputs. Although our tool works at the RT level, we have compared the numbers obtained from the gate level verication by the equivalence checkers with our tool output, in order to provide a basis for comparison. It is seen from Table 8.1 and Table 8.2 that the verication of multipliers are performed by
both Commercial Equivalence Checker 1 and Commercial Equivalence Checker 2 in time comparable to our tool. However, in the case of
equivalence checkers do not run to completion. Our tool, in comparison, veries the design in 24 seconds. It can also be seen that as the sizes increase, the time taken by our tool scales linearly with the size of the design.
48
Booth Multiplier
Commercial Commercial Tool 1 Tool 2 12s 9s 20s 16s not completed not completed not completed not completed -
Table 8.1: Comparison of execution times of Verire against two commercial equivalence checkers for a Booth multiplier of varying sizes. In each case the golden model was a shift and add multiplier of the corresponding size. In order to assist the Commercial Equivalence Checker 1 to compare RTL designs, we tried providing comparison points in the multiplier designs. These intermediary comparison points were the partial products obtained in the two multipliers. The results of this experiment are displayed in the Table 8.3. The Commercial Equivalence Checker runs to completion when assisted manually with comparison points for the
bit or
higher order multipliers. It may be noted that we have provided manual assistance with respect to the comparison points to the equivalence checkers, as opposed to our tool that generates these comparison points automatically. Our tool is effective in verifying multiplier designs that are modications (usually for optimization) to standard multipliers like Booth, Wallace +tree and Array multipliers. We used the tool for verifying the Verilog implementation of BISMUL [30], a complicated, modied Booth multiplier. In this case, a Booth multiplier veried by our technique was used as the golden design, and the BISMUL was the target design to be 49
Commercial Commercial Tool 1 Tool 2 10s 9s 18s 16s not completed not completed not completed not completed -
Table 8.2: Comparison of execution times of Verire against two commercial equivalence checkers for a Wallace Tree multiplier of varying sizes. In each case the golden model was a shift and add multiplier of the corresponding size. Multiplier Verire (Booth) 16s 19s 24s 37s 53s Commercial Verire Tool (Booth) (Wallace) 12s 14s 20s 18s 1942s 25s not completed 40s 60s Commercial Tool (Wallace) 10s 20s 972s not completed -
Table 8.3: Comparison of execution times of Verire against one commercial equivalence checker assisted by manual comparison points. Results are shown for both booth and wallace tree multipliers. veried. Our tool caught a bug in the Verilog code, that appeared while the tool tried to calculate the partial products after the rst comparison point. The symbolic expressions obtained after rewriting, for the observed output (product) variable (P), could not be proved equal at the next comparison point by Vprover. The rule correspondence that the tool had established, as well as the previous comparison point, provided an error trace.
50
8.2 Discussion
We discuss the intuition for the reason our technique can handle large designs as opposed to existing techniques as shown in our experimental results. Our technique is most powerful in the context of multiplier verication. Our tool can efciently equate two different, RT-level multiplier designs of any width. Equivalence checkers that use BDD-based algorithms [25] cannot handle large sizes of multipliers. Our tool manages to gracefully scale to large, complex multipliers. This is because, we represent circuits at a higher term level as opposed to the Boolean level representation used by the BDD based techniques. This is in part due to the efciency afforded at the level of terms and from our ability to decompose large monolithic designs. For instance, BDDs, that are widely used for verication in equivalence checking, represent circuits at the Boolean function level. This representation is necessarily canonical, and any comparison of two BDDs implies an exhaustive checking of Boolean formulae. This can get unmanageable in the case of complex formulae. We, however, represent circuits at a higher level, where we capture the system behavior as terms. These terms encapsulate the functionality of the circuit at a block/modular level. Therefore, it is easy and intuitive to decompose the terms into smaller subterms, to make the comparison problem more tractable. A principal reason why our technique gives spectacular gains, is the efcient and effective partitioning of the problem. We compute comparison points automatically. Unlike BDDs, the terms need not be compared only in their normal (canonical) form. They can be decomposed into smaller subterms, that can be com51
pared at intermediate points. The computation of these intermediate points in our technique, is automatic and efcient. The simplication process that performs expression equivalence, although not complete, is extremely efcient for large designs like multipliers. Term rewriting helps graceful scaling of verication to large, complex arithmetic circuits. The tradeoff in term rewriting is that the set of rewriting heuristics cannot be complete [10]. Hence, there is a possibility of a situation where the rewrite engine has to be modied to incorporate more reductions. However, for most of the space of practical designs, the type of heuristics that would be necessary for rewriting are already a part of our rewriter, Vprover.
52
Chapter 9 Conclusions
Our tool is dedicated for arithmetic circuit verication. A comparison to Binary Moment Diagrams [4], a technique that was established for multiplier verication is called for. Although BMDs are more effective than other model checking techniques, our technique achieves signicantly more, since we automatically compute comparison/matching points. A tremendous amount of savings are achieved by this. Since we do not deal with intermediate states in the huge state space of multipliers, but use some structural reductions of their implementation to arrive at the comparison points, we can automate this process. A major advantage of our technique is that we our tool accepts synthesizable Verilog as its input. Therefore we do not abstract out any implementation details, that many abstraction techniques in higher level verication do. We have managed to use our technique effectively to verify the entire datapath of microprocessors.(adders, shifters, comparators, multipliers). We plan to extend our tool to incorporate sequential circuits that can handle pipelining, so that we can verify the control paths of microprocessors. The disadvantage in term rewriting, is that the Vprover part of the tool, that implements the reduce() function introduced in Chapter 2 is incomplete. The reduce
53
function uses a database of rules, to simplify the expressions it is comparing. This database of rules may require additional rules to simplify new expressions. However, this incompleteness is traded for the efciency of the tool. Also, we have tried to incorporate a large number of rules that were needed to simplify the expressions that we encountered in the circuits we have targeted. In its current state, Vprover is very efcient for practical designs. Although BMDs [4] are more effective than other model checking techniques, our technique achieves signicantly more, since we automatically compute comparison points. Our technique is similar in spirit to a directed theorem proving approach. However, our technique requires much less user expertise and ingenuity than theorem provers [7], [17], [18], [23]. Our tool is a dedicated arithmetic circuit checker, and can be interfaced with equivalence checkers for arithmetic circuit verication. Another possibility is to integrate our tool with the theorem prover ACL2 [19], so that we can leverage the existing RTL library in ACL2 [9], [22] to add rules to Vprover in a sound manner. Toward this goal, we have implemented our technique for the verication of a RISC pipeline in Appendix A. Our technique is a step toward verication of two generic arithmetic circuits. We have managed to verify a large number of arithmetic circuits using our technique, like adders, shifters, and comparators. This technique can tackle a large part of the multiplier space, and many of the multipliers currently in use.
54
55
Appendices
56
A.1
Project Description
This project involves the implementation of a new verication technique
in ACL2, and its application in verifying real designs. There are two parts of the project:
Verication of 16 bit arithmetic operations of 74181 ALUs cascaded with a 74182 carry lookahead generator. Verication of a RISC pipeline that uses the 74181 ALU as its execution unit. Section A.2 describes the modeling of the technique in the ACL2 environ-
ment. The Section A.3 gives the actual ACL2 descriptions used in applying the technique to the
A.2
A.2.1 Using ACL2 The technique outlined in Chapter 2 is used within the ACL2 environment. The designs are modeled as function denitions in ACL2. We have seen that there 57
are two types of rewriting that are being performed in the technique. One type of rewriting is a part of the design itself that generates the comparable terms. The other is the rewriting that is necessary to prove expression equivalence at every comparison point. Modeling terms as ACL2 functions incorporates this internal rewriting. Comparison points are given externally to ACL2, and an ACL2 lemma is generated to check the equivalence of the two designs at each of these points. The proof of these lemmas, forms the second type of rewriting described above. The main theorem is to prove that the expression equivalence holds at every comparison point. We have used ACL2 to prove the expression equivalence. A.2.2 The 74181 ALU The 74181 ALU [15] is a 4 bit ALU that performs
different arithmetic
and logical operations. The arithmetic operations are given below. Since the technique is most effective for arithmetic circuits, we have veried only those operations of the 74181 IC.The ACL2 description of the 74181 was obtained from the literature. The arithmetic operations veried are addition, subtraction, increment, and decrement. The addition operation can be performed by assigning the corresponding values to the mode and select bits of the 74181. For instance, the addition operation can be invoked by the following function.
58
(let* ((outs (f74181 (not cin) a0 a1 a2 a3 b0 b1 b2 b3 nil t nil nil t)) (out1 (nth 0 outs)) (out2 (nth 1 outs)) (out3 (nth 2 outs)) (out4 (nth 3 outs)) (out5 (nth 4 outs))) (list out1 out2 out3 out4 (not out5))))
A ripple carry adder is used as the reference model. The ripple carry adder is rst proved to function correctly, against the
operator in ACL2.
The ACL2
(defun rca4bit (cin a0 a1 a2 a3 b0 b1 b2 b3) (let* ((state0 (rca1bit cin a0 b0)) (s0 (car state0)) (c1 (cadr state0)) 59
(state1 (rca1bit c1 a1 b1)) (s1 (car state1)) (c2 (cadr state1)) (state2 (rca1bit c2 a2 b2)) (s2 (car state2)) (c3 (cadr state2)) (state3 (rca1bit c3 a3 b3)) (s3 (car state3)) (cout (cadr state3))) (list s0 s1 s2 s3 cout)))
The 74181 can be cascaded using the SN74182 carry lookahead generator to form a 16 bit ALU. The verication of the 16 bit adder operation of 74181 is described in the next section.
A.3
(defun 16bit74181 (cin a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15) 60
(let* ((P0 (nth 5 (f74181-adder cin a0 a1 a2 a3 b0 b1 b2 b3))) (P1 (nth 5 (f74181-adder nil a4 a5 a6 a7 b4 b5 b6 b7))) (P2 (nth 5 (f74181-adder nil a8 a9 a10 a11 b8 b9 b10 b11))) (P3 (nth 5 (f74181-adder nil a12 a13 a14 a15 b12 b13 b14 b15))) (G0 (nth 6 (f74181-adder cin a0 a1 a2 a3 b0 b1 b2 b3))) (G1 (nth 6 (f74181-adder nil a4 a5 a6 a7 b4 b5 b6 b7))) (G2 (nth 6 (f74181-adder nil a8 a9 a10 a11 b8 b9 b10 b11))) (G3 (nth 6 (f74181-adder nil a12 a13 a14 a15 b12 b13 b14 b15))) (pglist (list cin P0 P1 P2 P3 G0 G1 G2 G3)) (carrylist (sn74182 pglist)) (c3 (nth 0 carrylist)) (c7 (nth 1 carrylist)) (c11 (nth 2 carrylist)) (adder1 (f74181-adder cin a0 a1 a2 a3 b0 b1 b2 b3))
61
(adder2
(f74181-adder c3 a4 a5 a6 a7 b4 b5 b6 b7))
(adder3
(adder4
(f74181-adder c11 a12 a13 a14 a15 b12 b13 b14 b15)))
(list (nth 0 adder1) (nth 1 adder1) (nth 2 adder1) (nth 3 adder1) (nth 0 adder2) (nth 1 adder2) (nth 2 adder2) (nth 3 adder2) (nth 0 adder3) (nth 1 adder3) (nth 2 adder3) (nth 3 adder3) (nth 0 adder4) (nth 1 adder4) (nth 2 adder4) (nth 3 adder4) (nth 4 adder4))))
The observation function is dened for both the golden (reference) model as follows. In the case of the adder, it is the sum output at every comparison point.
(defun obs_g (m cin a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15) (let* ((outputs (rca16bit cin a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 62
b13 b14 b15)) (out0 (nth 0 outputs)) (out1 (nth 1 outputs)) (out2 (nth 2 outputs)) (out3 (nth 3 outputs)) (out4 (nth 4 outputs)) (out5 (nth 5 outputs)) (out6 (nth 6 outputs)) (out7 (nth 7 outputs)) (out8 (nth 8 outputs)) (out9 (nth 9 outputs)) (out10 (nth 10 outputs)) (out11 (nth 11 outputs)) (out12 (nth 12 outputs)) (out13 (nth 13 outputs)) (out14 (nth 14 outputs)) (out15 (nth 15 outputs))) (cond ((equal m 1) (list out0 out1 out2 out3)) ((equal m 2) (list out4 out5 out6 out7)) ((equal m 3) (list out8 out9 out10 out11))
63
Similarly, the observation function for the revised design as dened as follows.
(defun obs_r (m cin a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15) (let* ((outputs (16bit74181 cin a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15)) (out0 (nth 0 outputs)) (out1 (nth 1 outputs)) (out2 (nth 2 outputs)) (out3 (nth 3 outputs)) (out4 (nth 4 outputs)) (out5 (nth 5 outputs)) (out6 (nth 6 outputs)) (out7 (nth 7 outputs)) 64
(out8 (nth 8 outputs)) (out9 (nth 9 outputs)) (out10 (nth 10 outputs)) (out11 (nth 11 outputs)) (out12 (nth 12 outputs)) (out13 (nth 13 outputs)) (out14 (nth 14 outputs)) (out15 (nth 15 outputs))) (cond ((equal m 1) (list out0 out1 out2 out3)) ((equal m 2) (list out4 out5 out6 out7)) ((equal m 3) (list out8 out9 out10 out11)) ((equal m 4) (list out12 out13 out14 out15))))) The conditional statement where
the different comparison points, and the observation function at each comparison point. For instance, at the rst comparison point, the rst four bits of sum obtained in both the system are compared. The comparison points are provided to the ACL2 proof engine. In the 74181, after a single rewriting step the sum of the rst four bits is obtained. However, in the ripple carry model, the sum of a single bit is obtained at the end of every 65
rewriting step. So, the ripple carry adder is stepped times, and the sum of the rst four bits is compared at the rst comparison point. At the last comparison point, the normal form of the two systems is compared, i.e the last comparison point is at the state where the bits of the sum are obtained. The main theorem states that the two TRSs are equal if the observation functions at every comparison point are proved equal. We used ACL2 to prove the expression equivalence of outputs at each comparison point by proving the following lemma that establishes equivalence of four bits of the sum outputs at any comparison point.
(defthm 4biteq (implies (and (Booleanp cin) (Booleanp a0) (Booleanp a1) (Booleanp a2) (Booleanp a3) (Booleanp b0) (Booleanp b1) (Booleanp b2) (Booleanp b3)) (equal (rca4bit cin a0 a1 a2 a3 b0 b1 b2 b3)
66
(f74181-adder cin a0 a1 a2 a3 b0 b1 b2 b3)))) ACL2 proved both these theorems, thereby establishing the equivalence of the two designs. Other arithmetic operations of the 74181 ALU were also veried like Subtraction, Increment, Decrement. The ripple carry adder model was used as the reference for these operations also. The proof procedure is very similar to that described for the addition.
A.4
To illustrate this technique, we are currently working on a simple RISC pipeline that consists of four operations per instruction; fetch, decode, execute and write back. The processor system has an instruction memory ia, a register le rf and a program counter pc.The execution phase calls the 74181 ALU module to execute the operations. The ACL2 denitions for these four operations are: (defconst *initialrf* (list (list r1 nil) (list r2 nil) (list r3 nil) (list r4 nil) (list r5 nil) (list r6 nil) (list r7 nil) (list r8 nil)))
(defun fetch( pc rf im) (let* ((ir (car (car rf)))) (list (+ 1 pc) (list (list ir (nth pc im)) (cdr rf)) im nil)))
(defun decode(pc rf im) (let* ((fetchedi (car (cadr (car rf))))) (cond ((equal fetchedi add) (list pc rf im (list t nil t nil nil t) nil)) ((equal fetchedi sub) (list pc rf im (list nil nil nil t t nil) nil)) ((equal fetchedi inc) (list pc rf im (list nil nil t t t t) nil)) ((equal fetchedi dec)
68
(defun excecute(pc rf im cntrl) (let* ((reg1 (cond ((equal(cadr (cadr (nth 0 rf))) r2) (cadr(nth 1 rf))) ((equal(cadr (cadr (nth 0 rf))) r3) (cadr(nth 2 rf))) ((equal(cadr (cadr (nth 0 rf))) r4) (cadr(nth 3 rf))) ((equal(cadr (cadr (nth 0 rf))) r5) (cadr(nth 4 rf))) ((equal(cadr (cadr (nth 0 rf))) r6) (cadr(nth 5 rf))) ((equal(cadr (cadr (nth 0 rf))) r7) (cadr(nth 6 rf))) ((equal(cadr (cadr (nth 0 rf))) r8) (cadr(nth 7 rf))) ((equal(cadr (cadr (nth 0 rf))) r9) (cadr(nth 8 rf))))) (reg2 (cond ((equal(caddr (cadr (nth 0 rf))) r2) (cdr(nth 1 rf)))
69
((equal(caddr (cadr (nth 0 rf))) r3) (cdr(nth 2 rf))) ((equal(caddr (cadr (nth 0 rf))) r4) (cdr(nth 3 rf))) ((equal(caddr (cadr (nth 0 rf))) r5) (cdr(nth 4 rf))) ((equal(caddr (cadr (nth 0 rf))) r6) (cdr(nth 5 rf))) ((equal(caddr (cadr (nth 0 rf))) r7) (cdr(nth 6 rf))) ((equal(caddr (cadr (nth 0 rf))) r8) (cdr(nth 7 rf))) ((equal(caddr (cadr (nth 0 rf))) r9) (cdr(nth 8 rf)))))) (list pc rf im (f74181 (nth 0 cntrl)(nth 0 reg1)(nth 1 reg1) (nth 2 reg1)(nth 3 reg1)(nth 0 reg2) (nth 1 reg2)(nth 2 reg2)(nth 3 reg2) (nth 1 cntrl) (nth 2 cntrl)(nth 3 cntrl) (nth 4 cntrl)(nth 5 cntrl)))))
70
(out0 (nth 0 retval)) (out1 (nth 1 retval)) (out2 (nth 2 retval)) (out3 (nth 3 retval))) (put-assoc-eq reg2 (list out0 out1 out2 out3) rf)))
The execution phase calls the 74181 ALU module to execute the operations. The reference design in this case can be a non-pipelined machine system that takes four machine cycles to process a single instruction. The comparison point for both these machine systems will be after instructions are executed in the two system. cycles This takes machine cycles for the non pipelined system, as opposed to
in the pipelined machine system. We are working on the proof of equivalence of these two machines.
71
The shfadd 16bit is the main module which in turn refers to a -bit full adder (serialadd 32bit) and a -multiplexer (mux 32bit). module fulladder (a, b, c, x, y); input a, b, c; output x, y; wire w_x, w_y;
not (a_, a); not (b_, b); nand (an, a_, b); nand (bn, a, b_); nand (axb, an, bn); not (c_, c); 72
not (axb_, axb); nand (axbn, axb_, c); nand (cn, c_, axb); nand (w_x, axbn, cn);
nand (anb, a, b); nand (anc, a, c); nand (bnc, b, c); and (anbanc, anb, anc); nand (w_y, anbanc, bnc); endmodule // fulladder
cout; carry;
wire [30:0]
fulladder (a[0], b[0], cin, s[0], carry[0]); fulladder (a[1], b[1], carry[0], s[1], carry[1]); fulladder (a[2], b[2], carry[1], s[2], carry[2]); fulladder (a[3], b[3], carry[2], s[3], carry[3]);
73
fulladder (a[4], b[4], carry[3], s[4], carry[4]); fulladder (a[5], b[5], carry[4], s[5], carry[5]); fulladder (a[6], b[6], carry[5], s[6], carry[6]); fulladder (a[7], b[7], carry[6], s[7], carry[7]); fulladder (a[8], b[8], carry[7], s[8], carry[8]); fulladder (a[9], b[9], carry[8], s[9], carry[9]); fulladder (a[10], b[10], carry[9], s[10], carry[10]); fulladder (a[11], b[11], carry[10], s[11], carry[11]); fulladder (a[12], b[12], carry[11], s[12], carry[12]); fulladder (a[13], b[13], carry[12], s[13], carry[13]); fulladder (a[14], b[14], carry[13], s[14], carry[14]); fulladder (a[15], b[15], carry[14], s[15], carry[15]); fulladder (a[16], b[16], carry[15], s[16], carry[16]); fulladder (a[17], b[17], carry[16], s[17], carry[17]); fulladder (a[18], b[18], carry[17], s[18], carry[18]); fulladder (a[19], b[19], carry[18], s[19], carry[19]); fulladder (a[20], b[20], carry[19], s[20], carry[20]); fulladder (a[21], b[21], carry[20], s[21], carry[21]); fulladder (a[22], b[22], carry[21], s[22], carry[22]); fulladder (a[23], b[23], carry[22], s[23], carry[23]); fulladder (a[24], b[24], carry[23], s[24], carry[24]); fulladder (a[25], b[25], carry[24], s[25], carry[25]);
74
fulladder (a[26], b[26], carry[25], s[26], carry[26]); fulladder (a[27], b[27], carry[26], s[27], carry[27]); fulladder (a[28], b[28], carry[27], s[28], carry[28]); fulladder (a[29], b[29], carry[28], s[29], carry[29]); fulladder (a[30], b[30], carry[29], s[30], carry[30]); fulladder (a[31], b[31], carry[30], s[31], cout); endmodule // serialadd_32bit
not (s_n, s); and (as, a, s); and (bs, b, s_n); or (o, as, bs); endmodule // mux2to1
75
mux2to1 (in[0], 1b0, select, out[0]); mux2to1 (in[1], 1b0, select, out[1]); mux2to1 (in[2], 1b0, select, out[2]); mux2to1 (in[3], 1b0, select, out[3]); mux2to1 (in[4], 1b0, select, out[4]); mux2to1 (in[5], 1b0, select, out[5]); mux2to1 (in[6], 1b0, select, out[6]); mux2to1 (in[7], 1b0, select, out[7]); mux2to1 (in[8], 1b0, select, out[8]); mux2to1 (in[9], 1b0, select, out[9]); mux2to1 (in[10], 1b0, select, out[10]); mux2to1 (in[11], 1b0, select, out[11]); mux2to1 (in[12], 1b0, select, out[12]); mux2to1 (in[13], 1b0, select, out[13]); mux2to1 (in[14], 1b0, select, out[14]); mux2to1 (in[15], 1b0, select, out[15]); mux2to1 (in[16], 1b0, select, out[16]); mux2to1 (in[17], 1b0, select, out[17]); mux2to1 (in[18], 1b0, select, out[18]); mux2to1 (in[19], 1b0, select, out[19]); mux2to1 (in[20], 1b0, select, out[20]); mux2to1 (in[21], 1b0, select, out[21]);
76
mux2to1 (in[22], 1b0, select, out[22]); mux2to1 (in[23], 1b0, select, out[23]); mux2to1 (in[24], 1b0, select, out[24]); mux2to1 (in[25], 1b0, select, out[25]); mux2to1 (in[26], 1b0, select, out[26]); mux2to1 (in[27], 1b0, select, out[27]); mux2to1 (in[28], 1b0, select, out[28]); mux2to1 (in[29], 1b0, select, out[29]); mux2to1 (in[30], 1b0, select, out[30]); mux2to1 (in[31], 1b0, select, out[31]); endmodule // mux_32bit
output [31:0] P;
wire [31:0] p_0, p_1, p_2, p_3, p_4, p_5, p_6, p_7; wire [31:0] p_8, p_9, p_10, p_11, p_12, p_13, p_14, p_15; wire cout; pp1, pp2, pp3, pp4, pp5, pp6, pp7, pp8; pp9, pp10, pp11, pp12, pp13, pp14, pp15;
77
assign
P = p_15;
mux_32bit (A[1], {15b0, B, 1b0}, pp1); serialadd_32bit (p_0, pp1, 1b0, p_1, cout);
mux_32bit (A[2], {14b0, B, 2b0}, pp2); serialadd_32bit (p_1, pp2, 1b0, p_2, cout);
mux_32bit (A[3], {13b0, B, 3b0}, pp3); serialadd_32bit (p_2, pp3, 1b0, p_3, cout);
mux_32bit (A[4], {12b0, B, 4b0}, pp4); serialadd_32bit (p_3, pp4, 1b0, p_4, cout);
mux_32bit (A[5], {11b0, B, 5b0}, pp5); serialadd_32bit (p_4, pp5, 1b0, p_5, cout);
mux_32bit (A[6], {10b0, B, 6b0}, pp6); serialadd_32bit (p_5, pp6, 1b0, p_6, cout);
78
mux_32bit (A[8], {8b0, B, 8b0}, pp8); serialadd_32bit (p_7, pp8, 1b0, p_8, cout);
mux_32bit (A[9], {7b0, B, 9b0}, pp9); serialadd_32bit (p_8, pp9, 1b0, p_9, cout);
mux_32bit (A[10], {6b0, B, 10b0}, pp10); serialadd_32bit (p_9, pp10, 1b0, p_10, cout);
mux_32bit (A[11], {5b0, B, 11b0}, pp11); serialadd_32bit (p_10, pp11, 1b0, p_11, cout);
mux_32bit (A[12], {4b0, B, 12b0}, pp12); serialadd_32bit (p_11, pp12, 1b0, p_12, cout);
mux_32bit (A[13], {3b0, B, 13b0}, pp13); serialadd_32bit (p_12, pp13, 1b0, p_13, cout);
mux_32bit (A[14], {2b0, B, 14b0}, pp14); serialadd_32bit (p_13, pp14, 1b0, p_14, cout);
79
mux_32bit (A[15], {1b0, B, 15b0}, pp15); serialadd_32bit (p_14, pp15, 1b0, p_15, cout); endmodule // shfadd_16bit
80
The Verilog code for a -bit Booth multiplier is described here. The booth 16bit is the main module which in turn refers to a -bit -way multiplexer (mux8way 32bit) and a -bit full adder (serialadd 32bit). The full adder code is the same as in Appendix B.
module ppgen (m, pp0, pp1, pp2, pp3, pp4, pp5, pp6, pp7); input [31:0] output [31:0] m; pp0, pp1, pp2, pp3, pp4, pp5, pp6, pp7;
//pp0
//pp1 //pp2
//pp4
assign assign
//pp6 //pp7
serialadd_32bit ({m[30:0],1b0}, m, 1b0, p3, c); serialadd_32bit ({m[29:0],2b0}, m, 1b0, p5, c); serialadd_32bit ({m[30:0],1b0}, {m[29:0],2b0}, 1b0, p6, c); serialsub_32bit ({m[28:0],3b0}, m, p7, c);
endmodule // ppgen
module mux8way (select, p0, p1, p2, p3, p4, p5, p6, p7, po); input [2:0] input output select;
wire
mux2to1 (p1, p0, select[0], po_10); mux2to1 (p3, p2, select[0], po_11); mux2to1 (p5, p4, select[0], po_12); mux2to1 (p7, p6, select[0], po_13);
82
endmodule // mux8way
module mux8way_32bit (select, pp0, pp1, pp2, pp3, pp4, pp5, pp6, pp7, ppout); input [2:0] input [31:0] output [31:0] select; pp0, pp1, pp2, pp3, pp4, pp5, pp6, pp7; ppout;
mux8way (select, pp0[0], pp1[0], pp2[0], pp3[0], pp4[0], pp5[0], pp6[0], pp7[0], ppout[0]); mux8way (select, pp0[1], pp1[1], pp2[1], pp3[1], pp4[1], pp5[1], pp6[1], pp7[1], ppout[1]); mux8way (select, pp0[2], pp1[2], pp2[2], pp3[2], pp4[2], pp5[2], pp6[2], pp7[2], ppout[2]); mux8way (select, pp0[3], pp1[3], pp2[3], pp3[3], pp4[3], pp5[3], pp6[3], pp7[3], ppout[3]) mux8way (select, pp0[4], pp1[4], pp2[4], pp3[4], pp4[4], pp5[4],pp6[4], pp7[4], ppout[4]) mux8way (select, pp0[5], pp1[5], pp2[5], pp3[5], pp4[5],
83
pp5[5], pp6[5], pp7[5], ppout[5]) mux8way (select, pp0[6], pp1[6], pp2[6], pp3[6], pp4[6], pp5[6], pp6[6], pp7[6], ppout[6]) mux8way (select, pp0[7], pp1[7], pp2[7], pp3[7], pp4[7], pp5[7], pp6[7], pp7[7], ppout[7]) mux8way (select, pp0[8], pp1[8], pp2[8], pp3[8], pp4[8], pp5[8], pp6[8], pp7[8], ppout[8]) mux8way (select, pp0[9], pp1[9], pp2[9], pp3[9], pp4[9], pp5[9], pp6[9], pp7[9], ppout[9]) mux8way (select, pp0[10], pp1[10], pp2[10], pp3[10], pp4[10], pp5[10], pp6[10], pp7[10], ppout[10]) mux8way (select, pp0[11], pp1[11], pp2[11], pp3[11], pp4[11], pp5[11], pp6[11], pp7[11], ppout[11]) mux8way (select, pp0[12], pp1[12], pp2[12], pp3[12], pp4[12], pp5[12], pp6[12], pp7[12], ppout[12]) mux8way (select, pp0[13], pp1[13], pp2[13], pp3[13], pp4[13], pp5[13], pp6[13], pp7[13], ppout[13]) mux8way (select, pp0[14], pp1[14], pp2[14], pp3[14], pp4[14], pp5[14], pp6[14], pp7[14], ppout[14]) mux8way (select, pp0[15], pp1[15], pp2[15], pp3[15], pp4[15], pp5[15], pp6[15], pp7[15], ppout[15]) mux8way (select, pp0[16], pp1[16], pp2[16], pp3[16], pp4[16], pp5[16], pp6[16], pp7[16], ppout[16])
84
mux8way (select, pp0[17], pp1[17], pp2[17], pp3[17], pp4[17], pp5[17], pp6[17], pp7[17], ppout[17]) mux8way (select, pp0[18], pp1[18], pp2[18], pp3[18], pp4[18], pp5[18], pp6[18], pp7[18], ppout[18]) mux8way (select, pp0[19], pp1[19], pp2[19], pp3[19], pp4[19], pp5[19], pp6[19], pp7[19], ppout[19]) mux8way (select, pp0[20], pp1[20], pp2[20], pp3[20], pp4[20], pp5[20], pp6[20], pp7[20], ppout[20]) mux8way (select, pp0[21], pp1[21], pp2[21], pp3[21], pp4[21], pp5[21], pp6[21], pp7[21], ppout[21]) mux8way (select, pp0[22], pp1[22], pp2[22], pp3[22], pp4[22], pp5[22], pp6[22], pp7[22], ppout[22]) mux8way (select, pp0[23], pp1[23], pp2[23], pp3[23], pp4[23], pp5[23], pp6[23], pp7[23], ppout[23]) mux8way (select, pp0[24], pp1[24], pp2[24], pp3[24], pp4[24], pp5[24], pp6[24], pp7[24], ppout[24]) mux8way (select, pp0[25], pp1[25], pp2[25], pp3[25], pp4[25], pp5[25], pp6[25], pp7[25], ppout[25]) mux8way (select, pp0[26], pp1[26], pp2[26], pp3[26], pp4[26], pp5[26], pp6[26], pp7[26], ppout[26]) mux8way (select, pp0[27], pp1[27], pp2[27], pp3[27], pp4[27], pp5[27], pp6[27], pp7[27], ppout[27]) mux8way (select, pp0[28], pp1[28], pp2[28], pp3[28], pp4[28],
85
pp5[28], pp6[28], pp7[28], ppout[28]) mux8way (select, pp0[29], pp1[29], pp2[29], pp3[29], pp4[29], pp5[29], pp6[29], pp7[29], ppout[29]) mux8way (select, pp0[30], pp1[30], pp2[30], pp3[30], pp4[30], pp5[30], pp6[30], pp7[30], ppout[30]) mux8way (select, pp0[31], pp1[31], pp2[31], pp3[31], pp4[31], pp5[31], pp6[31], pp7[31], ppout[31])
endmodule // mux8way_32bit
wire [31:0] wire [31:0] wire [31:0] wire [31:0] wire [31:0] wire [31:0] wire [31:0] wire [31:0]
pp10, pp11, pp12, pp13, pp14, pp15, pp16, pp17; pp20, pp21, pp22, pp23, pp24, pp25, pp26, pp27; pp30, pp31, pp32, pp33, pp34, pp35, pp36, pp37; pp40, pp41, pp42, pp43, pp44, pp45, pp46, pp47; pp50, pp51, pp52, pp53, pp54, pp55, pp56, pp57; pp60, pp61, pp62, pp63, pp64, pp65, pp66, pp67; p, pp, pp1, pp2, pp3, ppout1, ppout2; ppout3, ppout4, ppout5, ppout6;
86
assign
P = p;
ppgen ({16b0,A}, pp10, pp11, pp12, pp13, pp14, pp15, pp16, pp17); mux8way_32bit (B[2:0], pp10, pp11, pp12, pp13, pp14, pp15, pp16, pp17, ppout1);
ppgen ({13b0,A,3b0}, pp20, pp21, pp22, pp23, pp24, pp25, pp26, pp27); mux8way_32bit (B[5:3], pp20, pp21, pp22, pp23, pp24, pp25, pp26, pp27, ppout2); serialadd_32bit (ppout1, ppout2, 1b0, pp, c);
ppgen ({10b0,A,6b0}, pp30, pp31, pp32, pp33, pp34, pp35, pp36, pp37); mux8way_32bit (B[8:6], pp30, pp31, pp32, pp33, pp34, pp35, pp36, pp37, ppout3); serialadd_32bit (pp, ppout3, 1b0, pp1, c);
ppgen ({7b0,A,9b0}, pp40, pp41, pp42, pp43, pp44, pp45, pp46, pp47); mux8way_32bit (B[11:9], pp40, pp41, pp42, pp43, pp44, pp45, pp46, pp47, ppout4);
87
ppgen ({4b0,A,12b0}, pp50, pp51, pp52, pp53, pp54, pp55, pp56, pp57); mux8way_32bit (B[14:12], pp50, pp51, pp52, pp53, pp54, pp55, pp56, pp57, ppout5); serialadd_32bit (pp2, ppout5, 1b0, pp3, c);
ppgen ({1b0,A,15b0}, pp60, pp61, pp62, pp63, pp64, pp65, pp66, pp67); mux8way_32bit ({2b0,B[15]}, pp60, pp61, pp62, pp63, pp64, pp65, pp66, pp67, ppout6); serialadd_32bit (pp3, ppout6, 1b0, p, c);
endmodule // booth_16bit
88
is the main multiplier module. In turn, it refers to a 73-bit carry save adder (csa 5op$), a 16-bit multiplexer along with shift by 1 bit (mplier shft16b$), and a partial product generator (ppreg$)
module ppreg$ (out,mul, iclk); input [63:0] mul; input iclk; output [669:0] out;
wire set0=1b0; wire set1=1b1; wire [66:0] pp0, pp1, pp2, pp3, pp4, pp5, pp6, pp7; wire [66:0] x3mul, x5mul, x7mul, pp8, ppn1; wire [66:0] p4_temp;
assign pp3= out[267:201]; assign pp4= out[334:268]; assign pp5= out[401:335]; assign pp6= out[468:402]; assign pp7= out[535:469]; assign pp8= out[602:536]; assign ppn1= out[669:603];
samp_hold67$ sh1 (out[133:67], {3b0, mul}, iclk); samp_hold67$ sh2 (out[200:134], {2b0, mul, 1b0}, iclk); latch67$ lat3 (out[267:201], x3mul, iclk); samp_hold67$ sh4 (out[334:268], {1b0, mul, 2b0}, iclk);
latch67$ lat5 (out[401:335], x5mul, iclk); latch67$ lat6 (out[468:402], {x3mul[65:0],1b0}, iclk); latch67$ lat7 (out[535:469], x7mul, iclk); inv67$ invm1 (out[669:603], {3b0, mul});
adder67b$ adder0 (x3mul, out[133:67], out[200:134], set0); adder67b$ adder1 (x5mul, out[133:67], out[334:268], set0); adder67b$ adder2 (x7mul, out[669:603], {mul, 3b0}, set1); endmodule
90
module mplier_shft16b$ (out, in, pen_a, iclk); input [15:0] in; input iclk; input pen_a; output [17:0] out; wire set1=1b1; wire set0=1b0;
pareg1b$ pa16 (out[17], in[15], set0, pen_a, set1, iclk); pareg1b$ pa15 (out[16], in[14], set0, pen_a, set1, iclk); pareg1b$ pa14 (out[15], in[13], set0, pen_a, set1, iclk); pareg1b$ pa13 (out[14], in[12], out[17], pen_a, set1, iclk); pareg1b$ pa12 (out[13], in[11], out[16], pen_a, set1, iclk); pareg1b$ pa11 (out[12], in[10], out[15], pen_a, set1, iclk); pareg1b$ pa10 (out[11], in[9], out[14], pen_a, set1, iclk); pareg1b$ pa9 (out[10], in[8], out[13], pen_a, set1, iclk); pareg1b$ pa8 (out[9], in[7], out[12], pen_a, set1, iclk); pareg1b$ pa7 (out[8], in[6], out[11], pen_a, set1, iclk); pareg1b$ pa6 (out[7], in[5], out[10], pen_a, set1, iclk); pareg1b$ pa5 (out[6], in[4], out[9], pen_a, set1, iclk); pareg1b$ pa4 (out[5], in[3], out[8], pen_a, set1, iclk); pareg1b$ pa3 (out[4], in[2], out[7], pen_a, set1, iclk); pareg1b$ pa2 (out[3], in[1], out[6], pen_a, set1, iclk);
91
pareg1b$ pa1 (out[2], in[0], out[5], pen_a, set1, iclk); pareg1b$ pa0 (out[1], set0, out[4], pen_a, set1, iclk); pareg1b$ pad5 (out[0], set0, out[3], pen_a, set1, iclk); endmodule
module csa_5op$ (cout, prod, v, w, x, y, z); input [75:0] v; input [75:0] w; input [75:0] x; input [75:0] y; input [75:0] z; output [75:0] prod; output cout; wire [75:0] aop, bop;
ha$ ha0 (prod[0], c1_1, v[0], w[0]); ha$ ha1 (s1_1, c1_2, v[1], w[1]); ha$ ha2 (s1_2, c1_3, v[2], w[2]); fa$ fa0 (s1_3, c1_4, v[3], w[3], x[3]); fa$ fa1 (s1_4, c1_5, v[4], w[4], x[4]); fa$ fa2 (s1_5, c1_6, v[5], w[5], x[5]); fa$ fa3 (s1_6, c1_7, v[6], w[6], x[6]); fa$ fa4 (s1_7, c1_8, v[7], w[7], x[7]);
92
fa$ fa5 (s1_8, c1_9, v[8], w[8], x[8]); fa$ fa6 (s1_9, c1_10, v[9], w[9], x[9]); fa$ fa7 (s1_10, c1_11, v[10], w[10], x[10]); fa$ fa8 (s1_11, c1_12, v[11], w[11], x[11]); fa$ fa9 (s1_12, c1_13, v[12], w[12], x[12]); fa$ fa10 (s1_13, c1_14, v[13], w[13], x[13]); fa$ fa11 (s1_14, c1_15, v[14], w[14], x[14]); fa$ fa12 (s1_15, c1_16, v[15], w[15], x[15]); fa$ fa13 (s1_16, c1_17, v[16], w[16], x[16]); fa$ fa14 (s1_17, c1_18, v[17], w[17], x[17]); fa$ fa15 (s1_18, c1_19, v[18], w[18], x[18]); fa$ fa16 (s1_19, c1_20, v[19], w[19], x[19]); fa$ fa20 (s1_20, c1_21, v[20], w[20], x[20]); fa$ fa21 (s1_21, c1_22, v[21], w[21], x[21]); fa$ fa22 (s1_22, c1_23, v[22], w[22], x[22]); fa$ fa23 (s1_23, c1_24, v[23], w[23], x[23]); fa$ fa24 (s1_24, c1_25, v[24], w[24], x[24]); fa$ fa25 (s1_25, c1_26, v[25], w[25], x[25]); fa$ fa26 (s1_26, c1_27, v[26], w[26], x[26]); fa$ fa27 (s1_27, c1_28, v[27], w[27], x[27]); fa$ fa28 (s1_28, c1_29, v[28], w[28], x[28]); fa$ fa29 (s1_29, c1_30, v[29], w[29], x[29]); fa$ fa30 (s1_30, c1_31, v[30], w[30], x[30]);
93
fa$ fa31 (s1_31, c1_32, v[31], w[31], x[31]); fa$ fa32 (s1_32, c1_33, v[32], w[32], x[32]); fa$ fa33 (s1_33, c1_34, v[33], w[33], x[33]); fa$ fa34 (s1_34, c1_35, v[34], w[34], x[34]); fa$ fa35 (s1_35, c1_36, v[35], w[35], x[35]); fa$ fa36 (s1_36, c1_37, v[36], w[36], x[36]); fa$ fa37 (s1_37, c1_38, v[37], w[37], x[37]); fa$ fa38 (s1_38, c1_39, v[38], w[38], x[38]); fa$ fa39 (s1_39, c1_40, v[39], w[39], x[39]); fa$ fa40 (s1_40, c1_41, v[40], w[40], x[40]); fa$ fa41 (s1_41, c1_42, v[41], w[41], x[41]); fa$ fa42 (s1_42, c1_43, v[42], w[42], x[42]); fa$ fa43 (s1_43, c1_44, v[43], w[43], x[43]); fa$ fa44 (s1_44, c1_45, v[44], w[44], x[44]); fa$ fa45 (s1_45, c1_46, v[45], w[45], x[45]); fa$ fa46 (s1_46, c1_47, v[46], w[46], x[46]); fa$ fa47 (s1_47, c1_48, v[47], w[47], x[47]); fa$ fa48 (s1_48, c1_49, v[48], w[48], x[48]); fa$ fa49 (s1_49, c1_50, v[49], w[49], x[49]); fa$ fa50 (s1_50, c1_51, v[50], w[50], x[50]); fa$ fa51 (s1_51, c1_52, v[51], w[51], x[51]); fa$ fa52 (s1_52, c1_53, v[52], w[52], x[52]); fa$ fa53 (s1_53, c1_54, v[53], w[53], x[53]);
94
fa$ fa54 (s1_54, c1_55, v[54], w[54], x[54]); fa$ fa55 (s1_55, c1_56, v[55], w[55], x[55]); fa$ fa56 (s1_56, c1_57, v[56], w[56], x[56]); fa$ fa57 (s1_57, c1_58, v[57], w[57], x[57]); fa$ fa58 (s1_58, c1_59, v[58], w[58], x[58]); fa$ fa59 (s1_59, c1_60, v[59], w[59], x[59]); fa$ fa60 (s1_60, c1_61, v[60], w[60], x[60]); fa$ fa61 (s1_61, c1_62, v[61], w[61], x[61]); fa$ fa62 (s1_62, c1_63, v[62], w[62], x[62]); fa$ fa63 (s1_63, c1_64, v[63], w[63], x[63]); fa$ fa64 (s1_64, c1_65, v[64], w[64], x[64]); fa$ fa65 (s1_65, c1_66, v[65], w[65], x[65]); fa$ fa66 (s1_66, c1_67, v[66], w[66], x[66]); ha$ ha1_67 (s1_67, c1_68, w[67], x[67]); ha$ ha1_68 (s1_68, c1_69, w[68], x[68]); ha$ ha1_69 (s1_69, c1_70, w[69], x[69]); ha$ ha1_70 (s1_70, c1_71, w[70], x[70]); ha$ ha1_71 (s1_71, c1_72, w[71], x[71]); ha$ ha1_72 (s1_72, c1_73, w[72], x[72]); ha$ ha1_73 (s1_73, c1_74, w[73], x[73]); ha$ ha1_74 (s1_74, c1_75, w[74], x[74]);
95
ha$ ha4 (s2_2, bop[3], s1_2, c1_2); ha$ ha5 (aop[3], bop[4], s1_3, c1_3); ha$ ha6 (aop[4], bop[5], s1_4, c1_4); ha$ ha7 (aop[5], bop[6], s1_5, c1_5); fa$ fa2_1 (aop[6], bop[7], y[6], s1_6, c1_6); fa$ fa2_2 (aop[7], bop[8], y[7], s1_7, c1_7); fa$ fa2_3 (aop[8], bop[9], y[8], s1_8, c1_8); fa$ fa2_4 (s2_9, c2_10, y[9], s1_9, c1_9); fa$ fa2_10 (s2_10, c2_11, y[10], s1_10, c1_10); fa$ fa2_11 (s2_11, c2_12, y[11], s1_11, c1_11); fa$ fa2_12 (s2_12, c2_13, y[12], s1_12, c1_12); fa$ fa2_13 (s2_13, c2_14, y[13], s1_13, c1_13); fa$ fa2_14 (s2_14, c2_15, y[14], s1_14, c1_14); fa$ fa2_15 (s2_15, c2_16, y[15], s1_15, c1_15); fa$ fa2_16 (s2_16, c2_17, y[16], s1_16, c1_16); fa$ fa2_17 (s2_17, c2_18, y[17], s1_17, c1_17); fa$ fa2_18 (s2_18, c2_19, y[18], s1_18, c1_18); fa$ fa2_19 (s2_19, c2_20, y[19], s1_19, c1_19); fa$ fa2_20 (s2_20, c2_21, y[20], s1_20, c1_20); fa$ fa2_21 (s2_21, c2_22, y[21], s1_21, c1_21); fa$ fa2_22 (s2_22, c2_23, y[22], s1_22, c1_22); fa$ fa2_23 (s2_23, c2_24, y[23], s1_23, c1_23); fa$ fa2_24 (s2_24, c2_25, y[24], s1_24, c1_24);
96
fa$ fa2_25 (s2_25, c2_26, y[25], s1_25, c1_25); fa$ fa2_26 (s2_26, c2_27, y[26], s1_26, c1_26); fa$ fa2_27 (s2_27, c2_28, y[27], s1_27, c1_27); fa$ fa2_28 (s2_28, c2_29, y[28], s1_28, c1_28); fa$ fa2_29 (s2_29, c2_30, y[29], s1_29, c1_29); fa$ fa2_30 (s2_30, c2_31, y[30], s1_30, c1_30); fa$ fa2_31 (s2_31, c2_32, y[31], s1_31, c1_31); fa$ fa2_32 (s2_32, c2_33, y[32], s1_32, c1_32); fa$ fa2_33 (s2_33, c2_34, y[33], s1_33, c1_33); fa$ fa2_34 (s2_34, c2_35, y[34], s1_34, c1_34); fa$ fa2_35 (s2_35, c2_36, y[35], s1_35, c1_35); fa$ fa2_36 (s2_36, c2_37, y[36], s1_36, c1_36); fa$ fa2_37 (s2_37, c2_38, y[37], s1_37, c1_37); fa$ fa2_38 (s2_38, c2_39, y[38], s1_38, c1_38); fa$ fa2_39 (s2_39, c2_40, y[39], s1_39, c1_39); fa$ fa2_40 (s2_40, c2_41, y[40], s1_40, c1_40); fa$ fa2_41 (s2_41, c2_42, y[41], s1_41, c1_41); fa$ fa2_42 (s2_42, c2_43, y[42], s1_42, c1_42); fa$ fa2_43 (s2_43, c2_44, y[43], s1_43, c1_43); fa$ fa2_44 (s2_44, c2_45, y[44], s1_44, c1_44); fa$ fa2_45 (s2_45, c2_46, y[45], s1_45, c1_45); fa$ fa2_46 (s2_46, c2_47, y[46], s1_46, c1_46); fa$ fa2_47 (s2_47, c2_48, y[47], s1_47, c1_47);
97
fa$ fa2_48 (s2_48, c2_49, y[48], s1_48, c1_48); fa$ fa2_49 (s2_49, c2_50, y[49], s1_49, c1_49); fa$ fa2_50 (s2_50, c2_51, y[50], s1_50, c1_50); fa$ fa2_51 (s2_51, c2_52, y[51], s1_51, c1_51); fa$ fa2_52 (s2_52, c2_53, y[52], s1_52, c1_52); fa$ fa2_53 (s2_53, c2_54, y[53], s1_53, c1_53); fa$ fa2_54 (s2_54, c2_55, y[54], s1_54, c1_54); fa$ fa2_55 (s2_55, c2_56, y[55], s1_55, c1_55); fa$ fa2_56 (s2_56, c2_57, y[56], s1_56, c1_56); fa$ fa2_57 (s2_57, c2_58, y[57], s1_57, c1_57); fa$ fa2_58 (s2_58, c2_59, y[58], s1_58, c1_58); fa$ fa2_59 (s2_59, c2_60, y[59], s1_59, c1_59); fa$ fa2_60 (s2_60, c2_61, y[60], s1_60, c1_60); fa$ fa2_61 (s2_61, c2_62, y[61], s1_61, c1_61); fa$ fa2_62 (s2_62, c2_63, y[62], s1_62, c1_62); fa$ fa2_63 (s2_63, c2_64, y[63], s1_63, c1_63); fa$ fa2_64 (s2_64, c2_65, y[64], s1_64, c1_64); fa$ fa2_65 (s2_65, c2_66, y[65], s1_65, c1_65); fa$ fa2_66 (s2_66, c2_67, y[66], s1_66, c1_66); fa$ fa2_67 (s2_67, c2_68, y[67], s1_67, c1_67); fa$ fa2_68 (s2_68, c2_69, y[68], s1_68, c1_68); fa$ fa2_69 (s2_69, c2_70, y[69], s1_69, c1_69); fa$ fa2_70 (s2_70, c2_71, y[70], s1_70, c1_70);
98
fa$ fa2_71 (s2_71, c2_72, y[71], s1_71, c1_71); fa$ fa2_72 (s2_72, c2_73, y[72], s1_72, c1_72); fa$ fa2_73 (s2_73, c2_74, y[73], s1_73, c1_73); fa$ fa2_74 (s2_74, c2_75, y[74], s1_74, c1_74);
ha$ ha10 (prod[2], cin, s2_2, c2_2); ha$ ha11 (aop[9], bop[10], z[9], s2_9); fa$ fa3_10 (aop[10], bop[11], z[10], s2_10, c2_10); fa$ fa3_11 (aop[11], bop[12], z[11], s2_11, c2_11); fa$ fa3_12 (aop[12], bop[13], z[12], s2_12, c2_12); fa$ fa3_13 (aop[13], bop[14], z[13], s2_13, c2_13); fa$ fa3_14 (aop[14], bop[15], z[14], s2_14, c2_14); fa$ fa3_15 (aop[15], bop[16], z[15], s2_15, c2_15); fa$ fa3_16 (aop[16], bop[17], z[16], s2_16, c2_16); fa$ fa3_17 (aop[17], bop[18], z[17], s2_17, c2_17); fa$ fa3_18 (aop[18], bop[19], z[18], s2_18, c2_18); fa$ fa3_19 (aop[19], bop[20], z[19], s2_19, c2_19); fa$ fa3_20 (aop[20], bop[21], z[20], s2_20, c2_20); fa$ fa3_21 (aop[21], bop[22], z[21], s2_21, c2_21); fa$ fa3_22 (aop[22], bop[23], z[22], s2_22, c2_22); fa$ fa3_23 (aop[23], bop[24], z[23], s2_23, c2_23); fa$ fa3_24 (aop[24], bop[25], z[24], s2_24, c2_24); fa$ fa3_25 (aop[25], bop[26], z[25], s2_25, c2_25);
99
fa$ fa3_26 (aop[26], bop[27], z[26], s2_26, c2_26); fa$ fa3_27 (aop[27], bop[28], z[27], s2_27, c2_27); fa$ fa3_28 (aop[28], bop[29], z[28], s2_28, c2_28); fa$ fa3_29 (aop[29], bop[30], z[29], s2_29, c2_29); fa$ fa3_30 (aop[30], bop[31], z[30], s2_30, c2_30); fa$ fa3_31 (aop[31], bop[32], z[31], s2_31, c2_31); fa$ fa3_32 (aop[32], bop[33], z[32], s2_32, c2_32); fa$ fa3_33 (aop[33], bop[34], z[33], s2_33, c2_33); fa$ fa3_34 (aop[34], bop[35], z[34], s2_34, c2_34); fa$ fa3_35 (aop[35], bop[36], z[35], s2_35, c2_35); fa$ fa3_36 (aop[36], bop[37], z[36], s2_36, c2_36); fa$ fa3_37 (aop[37], bop[38], z[37], s2_37, c2_37); fa$ fa3_38 (aop[38], bop[39], z[38], s2_38, c2_38); fa$ fa3_39 (aop[39], bop[40], z[39], s2_39, c2_39); fa$ fa3_40 (aop[40], bop[41], z[40], s2_40, c2_40); fa$ fa3_41 (aop[41], bop[42], z[41], s2_41, c2_41); fa$ fa3_42 (aop[42], bop[43], z[42], s2_42, c2_42); fa$ fa3_43 (aop[43], bop[44], z[43], s2_43, c2_43); fa$ fa3_44 (aop[44], bop[45], z[44], s2_44, c2_44); fa$ fa3_45 (aop[45], bop[46], z[45], s2_45, c2_45); fa$ fa3_46 (aop[46], bop[47], z[46], s2_46, c2_46); fa$ fa3_47 (aop[47], bop[48], z[47], s2_47, c2_47); fa$ fa3_48 (aop[48], bop[49], z[48], s2_48, c2_48);
100
fa$ fa3_49 (aop[49], bop[50], z[49], s2_49, c2_49); fa$ fa3_50 (aop[50], bop[51], z[50], s2_50, c2_50); fa$ fa3_51 (aop[51], bop[52], z[51], s2_51, c2_51); fa$ fa3_52 (aop[52], bop[53], z[52], s2_52, c2_52); fa$ fa3_53 (aop[53], bop[54], z[53], s2_53, c2_53); fa$ fa3_54 (aop[54], bop[55], z[54], s2_54, c2_54); fa$ fa3_55 (aop[55], bop[56], z[55], s2_55, c2_55); fa$ fa3_56 (aop[56], bop[57], z[56], s2_56, c2_56); fa$ fa3_57 (aop[57], bop[58], z[57], s2_57, c2_57); fa$ fa3_58 (aop[58], bop[59], z[58], s2_58, c2_58); fa$ fa3_59 (aop[59], bop[60], z[59], s2_59, c2_59); fa$ fa3_60 (aop[60], bop[61], z[60], s2_60, c2_60); fa$ fa3_61 (aop[61], bop[62], z[61], s2_61, c2_61); fa$ fa3_62 (aop[62], bop[63], z[62], s2_62, c2_62); fa$ fa3_63 (aop[63], bop[64], z[63], s2_63, c2_63); fa$ fa3_64 (aop[64], bop[65], z[64], s2_64, c2_64); fa$ fa3_65 (aop[65], bop[66], z[65], s2_65, c2_65); fa$ fa3_66 (aop[66], bop[67], z[66], s2_66, c2_66); fa$ fa3_67 (aop[67], bop[68], z[67], s2_67, c2_67); fa$ fa3_68 (aop[68], bop[69], z[68], s2_68, c2_68); fa$ fa3_69 (aop[69], bop[70], z[69], s2_69, c2_69); fa$ fa3_70 (aop[70], bop[71], z[70], s2_70, c2_70); fa$ fa3_71 (aop[71], bop[72], z[71], s2_71, c2_71);
101
fa$ fa3_72 (aop[72], bop[73], z[72], s2_72, c2_72); fa$ fa3_73 (aop[73], bop[74], z[73], s2_73, c2_73); fa$ fa3_74 (aop[74], bop[75], z[74], s2_74, c2_74);
module pareg1b$(out, pin, sin, pen, minitb, clk); input pin, sin, pen, minitb, clk; output out; wire muxout; wire set1=1b1;
mux2$ mux0 (muxout, sin, pin, pen); samp_hold$ sh (out, muxout, set1, clk);
endmodule
module mul_radix8$(paout, multiplier, multiplicand, minit, clk); input clk; input minit; input[63:0] multiplier;
102
wire [75:0] prod, ppv; wire [669:0] ppout; wire [75:0] ppw, ppx, ppy, ppz; wire [66:0] ppw_temp, ppx_temp, ppy_temp, ppz_temp; wire pen_a; wire pen_prod; wire cin; wire set0=1b0; wire [127:0] product; wire [17:0] mplier_w, mplier_x, mplier_y, mplier_z; wire [15:0] mplier_w_in, mplier_x_in, mplier_y_in, mplier_z_in; wire cout;
assign ppv =
{11b0, paout[140:76]};
assign product = paout[139:12]; assign cout = paout[140]; assign mplier_w_in ={multiplier[54:52], multiplier[42:40], multiplier[30:28], multiplier[18:16], multiplier[6:4], multiplier[0]}; assign mplier_x_in ={multiplier[57:55], multiplier[45:43],
103
multiplier[33:31], multiplier[21:19], multiplier[9:7], multiplier[1]}; assign mplier_y_in ={multiplier[60:58], multiplier[48:46], multiplier[36:34], multiplier[24:22], multiplier[12:10], multiplier[2]}; assign mplier_z_in ={multiplier[63:61], multiplier[51:49], multiplier[39:37], multiplier[27:25], multiplier[15:13], multiplier[3]};
seqcon_radix8$ seqcon (pen_a, pen_prod, iclk2, iclk3, iclk4, iclk5, minit, multiplier[3:1], clk); pareg128b$ pareg (paout, prod, multiplier, c76, pen_a, pen_prod, clk); ppreg$ ppreg (ppout, multiplicand, iclk2); mplier_shft16b$ mplier0 (mplier_w, mplier_w_in, pen_a, clk); mplier_shft16b$ mplier1 (mplier_x, mplier_x_in, pen_a, clk); mplier_shft16b$ mplier2 (mplier_y, mplier_y_in, pen_a, clk); mplier_shft16b$ mplier3 (mplier_z, mplier_z_in, pen_a, clk); mux8_67$ mux0 (ppw_temp, 67b0, ppout[133:67], ppout[200:134], ppout[267:201], ppout[334:268], ppout[401:335], ppout[468:402], ppout[535:469], mplier_w[2:0]); mux8_67$ mux1 (ppx_temp, 67b0, ppout[133:67], ppout[200:134], ppout[267:201], ppout[334:268], ppout[401:335],
104
ppout[468:402], ppout[535:469], mplier_x[2:0]); mux8_67$ mux2 (ppy_temp, 67b0, ppout[133:67], ppout[200:134], ppout[267:201], ppout[334:268], ppout[401:335], ppout[468:402], ppout[535:469], mplier_y[2:0]); mux8_67$ mux3 (ppz_temp, 67b0, ppout[133:67], ppout[200:134], ppout[267:201], ppout[334:268], ppout[401:335], ppout[468:402], ppout[535:469], mplier_z[2:0]); mux2_75$ mux4 (ppw, {9b0, ppw_temp}, {3b0, ppw_temp, 6b0}, iclk2); mux2_75$ mux5 (ppx, {6b0, ppx_temp, 3b0}, {2b0, ppx_temp, 7b0}, iclk2); mux2_75$ mux6 (ppy, {3b0, ppy_temp, 6b0}, {1b0, ppy_temp, 8b0}, iclk2); csa_5op$ csa0 (c76, prod, ppv, ppw, ppx, ppy, {ppz_temp,9b0}); endmodule
105
Bibliography
[1] Jonathan P. Bowen, He Jifeng, and Xu Qiwen. An animatable operational semantics of the Verilog Hardware Description Language. In John A. McDermid Shaoying Liu and Michael G. Hinchey, editors, Proc. ICFEM 2000: 3rd IEEE International Conference on Formal Engineering Methods, pages 199207. IEEE Computer Society Press, 2000. [2] Robert S. Boyer and J. Strother Moore. Program verication. Journal of Automated Reasoning, 1(1):1723, 1985. [3] R. E. Bryant. On the complexity of vlsi implementations and graph representations of boolean functions with application to integer multiplication. IEEE Transactions on Computers, 40(2):205213, 1991. [4] R. E. Bryant and Yirng-An Chen. Verication of arithmetic circuits with
binary moment diagrams. In Design Automation Conference, pages 535541, 1995. [5] J. R. Burch. Using bdds to verify multipliers. In Proceedings of the 28th conference on ACM/IEEE design automation conference, pages 408412. ACM Press, 1991. [6] E. M. Clarke, M. Fujita, and X. Zhao. Hybrid decision diagrams. In Proceedings of the 1995 IEEE/ACM international conference on Computer-aided 106
design, pages 159163, 1995. [7] D. Cyrluk. Microprocessor Verication in PVS: A Methodology and Simple Example. Technical Report SRI-CSL-93-12, Menlo Park, CA, 1993. [8] D. Kapur and M. Subramaniam. Mechanically verifying a family of multiplier circuits. In Rajeev Alur and Thomas A. Henzinger, editors, Proceedings
of the Eighth International Conference on Computer Aided Verication CAV, volume 1102, pages 135146, New Brunswick, NJ, USA, / 1996. Springer Verlag. [9] D. M. Russinoff. A Mechanically Checked Proof of IEEE Compliance of a Register-Transfer-Level Specication of the AMD-K7 Floating-Point Multiplication, Division, and Square Root Instructions. In LMS Journal of Computation and Mathematics, volume 1, pages 148200, December 1998. [10] Nachum Dershowitz. A taste of rewrite systems. In Functional Programming, Concurrency, Simulation and Automated Reasoning, pages 199228, 1993. [11] VHDL Synthesis Interoperability Working Group. Ieee p1076.6/d2.01 draft standard for vhdl register transfer level synthesis. [12] H. Anderson, P. Williams, and H. Hulgaard. Equivalence Checking of Combinational Circuits using Boolean Expression Diagrams. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 18(7), 1999.
107
[13] J. Hoe and Arvind. Hardware synthesis from term rewriting systems. In X IFIP International Conference on VLSI (VLSI 99), Lisbon, Portugal, November 1999. [14] W. A. Hunt. FM8501: A Veried Microprocessor. PhD thesis, University of Texas at Austin, 1985. [15] Texas Instruments. SN74181, SN74LS181, SN74S181 Arithmetic Logic
Units/ Function Generators. Bulletin No. DL-S 7611831, December 1972. [16] J. Field. A simple rewriting semantics for realistic imperative programs and its application to program analysis. In ACM SIGPLAN Workshop on Par-
tial Evaluation and Semantics-Based Program Manipulation, pages 98107, 1990. [17] D. Kapur. Theorem proving support for hardware verication. In Third
Intl. Workshop on First-Order Theorem Proving (FTP 2000), St. Andrews, Scotland, July 2000. [18] D. Kapur and H. Zhang. An overview of Rewrite Rule Laboratory (RRL). J. Computer and Mathematics with Applications, 29(2):91114, 1995. [19] M. Kaufmann and J. Moore. ACL2: An industrial strength version of nqthm. In Compass96: Eleventh Annual Conference on Computer Assurance, page 23, Gaithersburg, Maryland, 1996. National Institute of Standards and Technology. [20] M. Kaufmann. Personal Communication. 108
[21] J. Klop. Term Rewriting Systems. In In S. Abramsky, D. M. Gabbay, and T. S. E. Maibaum, editors: Handbook of Logik in Computer Science, Oxford University Press, volume 2, pages 1116, 1992. [22] M. Kaufmann and D. Russinoff. Verication of Pipeline Circuits. In ACL2 Workshop 2000 (proceedings are available as UTCS Technical Report TR-0029), October 2000. [23] Z. Manna, N. Bjorner, A. Browne, E. Y. Chang, M. Colon, L. de Alfaro, H. Devarajan, A. Kapur, J. Lee, H. Sipma, and T. E. Uribe. Step: The stanford temporal prover. In TAPSOFT, pages 793794, 1995. [24] IEEE 1364-2001 Standard Verilog Language Reference Manual. [25] Y. Matsunaga. An Efcient Equivalence Checker for Combinational Circuits. In Proceedings of Design Automation Conference, pages 629634, 1996. [26] J. Sawada and W. A. Hunt. Processor verication with precise exceptions and speculative execution. In Proc. 10th International Computer Aided Verication Conference, pages 135146, 1998. [27] X. Shen. Design and Verication of Speculative Processors. In Proceedings of the Workshop on Formal Techniques for Hardware and Hardware-like Systems, Marstrand, Sweden, June 1998. [28] J. Strother Moore. Personal Communication.
109
[29] Li Yongjian and He Jifeng. Towards a theory of bisimulation for a fragment of verilog. In International Parallel and Distributed Processing Symposium (IPDPS03), pages 22 26, April 2003. [30] H. Yu and J. A. Abraham. An Efcient 3-bit-scan Multiplier without Overlappong Bits, and its 64X64 Bit Implementation. In Proceedings of 7th Asia and South Pacic Design Automation Conference, January 2002. [31] Z. Zhou, X. Song, F. Corella, E. Cerny, and M. Langevin. Description and verication of RTL designs using multiway decision graphs. In Proceedings of the Conference on Hardware Description Languages, 1995. [32] Zhu Huibiao, Jonathan P. Bowen, and He Jifeng. Soundness, completeness and non-redundancy of operational semantics for Verilog based on denotational semantics. In Chris George and Huaikou Miao, editors, Formal Methods and Software Engineering, ICFEM 2002: 4th International Conference on Formal Engineering Methods, volume 2495 of Lecture Notes in Computer Science, pages 600612. Springer-Verlag, 2125 October 2002. Extended version to be available as Technical Report SBU-CISM-02-07, SCISM, South Bank University, London, UK, 2002.
110
Vita
Shobha Vasudevan did her Bachelors in Computer Engineering from the University of Mumbai, India. She is currently in the PhD program with Dr. Jacob Abraham. Her interests are formal verication of RT-Level designs, verication of C-level specications, software verication techniques and their application to hardware.
EX is a document preparation system developed by Leslie Lamport as a special version of Donald Knuths TEX Program.
LT A
111