12a Timing Optimization

Logic Restructuring for
Timing Optimization
Outline:
Definitions and problem statement
Overview of techniques (motivated by
adders)
Tree height reduction (THR)

Generalized bypass transform (GBX)
Generalized select transform (GST)
Partial collapsing (?)
Timing Optimization
Factors determining delay of circuit:
Underlying circuit technology
Circuit type (e.g. domino, static CMOS, etc.)

Gate type
Gate size
Logical structure of circuit
Length of computation paths

False paths
Buffering
Parasitics
Wire loads
Layout
Problem Statement
Given:
Initial circuit function description
Library of primitive functions
Performance constraints (arrival/required times)
Generate:
an implementation of the circuit using the primitive
functions, such that:
1. performance constraints are met
2. circuit area is minimized
Current Design Process

Behavioral description
Logic and latches

Logic equations
Gate library
Perf. Constraints
Delay models
Gate netlist
Layout
Behavior
Optiization
(scheduling)
Partitioning
(retiming)
Logic synthesis
Technology independent
Technology mapping
Timing driven
place and route
Technology mapping for

delay
Function
tree
Buffer
tree
Overview of Solutions for delay

1. Circuit re-structuring
Rescheduling operations to reduce time of computation
2. Implementation of function trees (technology mapping)
Selection of gates from library

Minimum delay (load independent model - Kukimoto)
Minimize delay and area (Jongeneel, DAC00)
(combines Lehman-Watanabe and Kukimoto)
3. Implementation of buffer trees
Touati (LT-trees)
Singh
4. Resizing
Focus here on circuit re-structuring
Circuit re-structuring
Approaches:
Local:
Mimic optimization techniques in adders
Carry lookahead (THR tree height reduction)

Conditional sum (GST transformation)
Carry bypass (GBX transformation)
Global:
Reduce depth of entire circuit
Partial collapsing
Boolean simplification
Re-structuring methods
Performance measured by
1.
2.
3.
Level based optimizations:
levels,
sensitizable paths,
technology dependent delays
Tree height reduction (Singh 88)
Partial collapsing and simplification (Touati 91)
Generalized select transform (Berman 90)
Sensitizable paths
Generalized bypass transform (Mcgeer 91)
Re-structuring for delay:

tree-height reduction
6
5
1
l
i
Collapsed
Critical region
n 5
Critical
m 1 region
4
k
j 3
n
1
2
0 0
h
0
0 0 0
5
1
m 1
4
3
Duplicated
logic
h
2 0
b c d e f
0 0
2 0
c d e f
Restructuring for delay:

path reduction
5
Collapsed
Critical region
m
2
4
0 0
j 3
4
n
3
2
Duplicated
logic
h
0
0 0
2 0
c d e f
Singh 88
5
m 1
2
4
k
0
j 3
0
h
1
New delay = 5
0 0
2 0
c d e f
10
Generalized bypass transform

(GBX)
Make critical path false
Speed up the circuit
Bypass logic of critical path(s)

fm=f
fm+1
fn=g
McGeer 91
fm =f
fm+1
Boolean
difference
fn=g
dg
__
df
0
g
1
s-a-0 redundant
11
GBX and KMS transform
GBX gives little area increase, BUT have now created an untestable
fault (on control input to multiplexor)
KMS transform: (remove false paths without increasing delay)
1.
2.
fk is last node on false path that fans out.

Duplicate false path {f1,, fk} -> {f1, , fk}
3. fj fans out to every fanout of f j except fj+1, and fj just fans out to fj+1
4. Set f0 input to f1 to controlling value and propagate constant (can do because
path is false and does not fanout)
KMS results
1. Function of every node, except f 1, ,fk is unchanged

2. Added k-1 nodes
3. Area added in linear in size of length of false paths; in practice small area
increase.
12
KMS (Keutzer, Malik, Saldanha 90)

fm
fm+1
fk
fm
fm+1
fk
fm
fm+1
fk
fk+1
fn
Delay is not
increased
fk+1
fn
13
End of lecture 20
14
Generalized select transform

(GST)
Late signal feeds multiplexor

a
out
b
c
Berman 90
a=0
b
a=1
out
b
c
15
0/1
a=0
difference =
h
ha ha
a
a=1
0 out
GBX
b
b
a=1
0/1
__
da
Note:
Boolean
dh
GBX
a=0
0
g
1
GST vs GBX
0
g
1
GST
16
GST vs GBX
Select transform appears to be more area
efficient
But Boolean difference generally more efficiently
formed in practice
No delay/speedup advantage for either transform
Need
one MUX per fanout in GST,
only one MUX in GBX
GST
out2
a
a=0
a=1
b
b
out1
17
Technology independent
delay reductions
Generally THR, GBX, GST (critical path based methods)
work OK, but not great
Why are technology independent delay reductions hard?
Lack of fast and accurate delay models
b
e
t
t
e
r
s
l
o
w
e
r
1. # levels, fast but crude

2. # levels + correction term (fanout, wires, ): a little better,
but still crude (what coefficients to use?)
3. Technology mapped: reasonable, but very slow
4. Place and route: better but extremely slow
5. Silicon: best, but infeasibly slow (except for FPGAs)
18
Clustering/partial-collapse
Traditional critical-path based methods require
Well defined critical path
Good delay/slack information
Problems:
Good delay information comes from mapper and layout

Delay estimates and models are weak
Possible solutions:
Better delay modeling at technology independent level

Make speedup, insensitive to actual critical paths and
mapped delays
19
Clustering/partial-collapse
Two-level circuits are fast
Collapse circuit to 2-level - but

Huge area penalty
Huge capacitive loading on inputs (can be much slower)
To avoid huge area penalty
Identify clusters of nodes

Each cluster has some fixed size
Perform collapse of each cluster
Simplify each node
Details
How to choose the clusters?

How to choose cluster size?
How to simplify each node?
20
Lawlers clustering algorithm

Optimal in delay:
For a given clustering size
May duplicate nodes (hence possible area penalty)

Not optimal w.r.t duplication
Use a heuristic
Fast: O(m x k)
m = number of edges in network

k = maximum cluster size
21
Clustering algorithm - overview

1. Label phase: (k is cluster size)
If node u is an input, label(u) := L := 0

Else L := max label of fanin of u
If (# nodes in TFI(u) with (label = L) >= k)
label(u) := L+1
2. Cluster phase: (outputs to inputs)
If node u is an output, L := infinity

Else L := max label of fanouts of u
If (label(u) < L) then create a new cluster with root u and with members all
the nodes in TFI(u) with label = label(u)
3. Collapse phase: (order independent)
Collapse all nodes in a cluster into a single node

Note: a node may be in several clusters (causes area increase
22
Example of clustering
0
0
0
0
1
1
k=3
2
Result: Lawlers algorithm
gives minimum depth circuit
0
0
0
1
1
Typically,
1. we decompose initial circuit
into 2-input NANDs and
invertors.
2. then cluster size k
reflects # 2-input NANDs
to be collapsed together.
23
Choosing k
I(k): number of levels, given k

d(k): duplication ratio
Number of gates in cluster network divided by number of gates in original

network
Determine k0 where k0/d(k0)~2.0

For every k from 2 to k0, compute d(k), I(k)
Use exhaustive enumeration: label and cluster (without collapse) for each k.
Each iteration is O(|E|k)
Choose k such that
I(k) is minimized
Break ties using d(k)
Minimize d(k)
d(k)
I(k)
1
k0
24
Area recovery
Area increase is due to node duplication -
this occurs when node is in multiple clusters
Two solutions:
1. Break clusters into smaller pieces off critical

path
2. After cluster and collapse, recover area
25
Relabeling procedure:
Attempt to increase node labels without exceeding cluster
size
In reverse topological order
Start : assign
new -label (Oi ) max label (O j )

j PO
Increase label(u) if
1. new-label(u) <= label(v) for each fanout v and
2. new-label(u) = new-label(v) for each fanout v only if label(u) =
label(v) before relabeling, and
3. no cluster size is violated
26
Relabeling example
0
0
0
0
0
0
0
0
0
0
1
1
1
1
before
after
27
Post-collapse area recovery

Do algebraic factorization, but
Undo factorization if depth increases
Full_simplify
Only consider node v as possible fanin of a node

(v introduced by using dont cares) if level of
v < level of node.
Redundancy removal
28
Conclusions
Variety of methods for delay optimization
No single technique dominates (KJ Singh PhD thesis)
When applied to ripple-carry adder get
Carry-lookahead adder (THR)

Carry-bypass adder (GBX)
Carry-select adder (GST)
? (partial collapse)
All techniques ignore false paths when assessing

the delay and critical regions
Can use KMS transform to eliminate false paths without

increasing delay (area increase however).
29

12a Timing Optimization

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

12a Timing Optimization

Hochgeladen von

Copyright:

Verfügbare Formate

Logic Restructuring for

Tree height reduction (THR)

Circuit type (e.g. domino, static CMOS, etc.)

Logical structure of circuit

Length of computation paths

Current Design Process

Logic and latches

Technology mapping for

Overview of Solutions for delay

Rescheduling operations to reduce time of computation

2. Implementation of function trees (technology mapping)

Selection of gates from library

3. Implementation of buffer trees

Carry lookahead (THR tree height reduction)

Level based optimizations:

Generalized bypass transform (Mcgeer 91)

Re-structuring for delay:

Restructuring for delay:

Generalized bypass transform

Bypass logic of critical path(s)

GBX and KMS transform

fk is last node on false path that fans out.

1. Function of every node, except f 1, ,fk is unchanged

KMS (Keutzer, Malik, Saldanha 90)

Generalized select transform

Late signal feeds multiplexor

1. # levels, fast but crude

Good delay information comes from mapper and layout

Better delay modeling at technology independent level

Collapse circuit to 2-level - but

To avoid huge area penalty

Identify clusters of nodes

How to choose the clusters?

Lawlers clustering algorithm

For a given clustering size

May duplicate nodes (hence possible area penalty)

m = number of edges in network

Clustering algorithm - overview

If node u is an input, label(u) := L := 0

2. Cluster phase: (outputs to inputs)

If node u is an output, L := infinity

3. Collapse phase: (order independent)

Collapse all nodes in a cluster into a single node

I(k): number of levels, given k

Number of gates in cluster network divided by number of gates in original

Determine k0 where k0/d(k0)~2.0

Choose k such that

this occurs when node is in multiple clusters

1. Break clusters into smaller pieces off critical

new -label (Oi ) max label (O j )

Post-collapse area recovery

Undo factorization if depth increases

Only consider node v as possible fanin of a node

No single technique dominates (KJ Singh PhD thesis)

When applied to ripple-carry adder get

Carry-lookahead adder (THR)

All techniques ignore false paths when assessing

Can use KMS transform to eliminate false paths without

Das könnte Ihnen auch gefallen