Sie sind auf Seite 1von 21

The Project JAZZ2 is the next generation of System-On-Chip for

C2 Microsystems. The Media Processor (MP) is the core


components of it. The Media Processor Unit (MPU) is a 32-RISC
superscalar unit equipped with a 256-bit Vector Unit and a
64KB combined cache/buffer (CCB) unit. The Scalar unit can
issue up to 4 instructions per cycle. This paper will focus on
the Vector Units (VU) physical design.
VU is the component which is dedicated in the vector
calculation. It mainly consists four parts: vector register file,
vector bypass and byte select crossbar, vector multiplier and
vector ALU. It has 232 entries of 256-bits (divided into 64
entries of 128-bits) register files and it has a total 128 88
multipliers. It can perform 8-bit, 16-bit and 32-bit integer, 32bit Floating point multiplies and 16-bit complex multiplies. In
order to improve the performance of the VU, reduce the
latency of the data path between each pipeline become very
important. Therefore custom layout of the data path is applied
in our design. Since the custom layout design is done by the
third party tool, How to effectively integrate the custom
placement with ICC (IC Compiler) and how to optimize the
congestion and timing in ICC becomes the big issue for us.
In this paper we will introduce our special design flow for VU
by using Synopsys tools. It will also share some experience of
using IC compiler (Version ICC_vZ_2007.03-SP1) to optimize the
congestion and timing for the VU physical design.
Also, IC Compiler RP (Relative Placement) can do the same
task with the third party tool we have used in VUs layout
design for the custom data path placement. It will be
introduced in this paper.
Finally, the timing analysis and the timing correlation between
ICC and Primetime SI will be mentioned in this paper.
2.0 Design Flow

The figure 1 shows the special design flow for VU physical


design. It mainly consists the netlist generation, custom
placement of data path, Scan stitch, Physical design and
timing analysis. The steps which are in the bold test box are
done by the Synopsys tools and the others are done by the
third party tools.

Figure 1 Design Flow for VU


2.1 Netlist Generation

The VU module is consist of data path and control logic. Since


the data path need to pay special attention to the placement,
the data path netlist and the placement will be done by the
third party tool. At the same time the other logic will be
synthesized by the Design compiler. Finally, the third party tool
will read all the netlists and hook them up and generate data
path Placement information. The details of custom placement
by third party tool will be introduced by chapter 3.
2.2 Scan Chain Stitch
Generally, the DFT Compiler will stitch the scan register by the
hierarchy and the name of the register. Although IC Compiler
can do the scan reorder in the internal of one scan chain or
between different scan chains, it can not do what we expect
exactly because of the random scan stitch in DFT. For VU
physical design, since the data path is placed by custom
design and the data bus are group together, it can be done by
specific indication in normal DFT design flow to simply the
design flow and do exactly what we expect. In this VU DFT
design, we group the bus registers (which are placed by
custom placement) and stitch the neighbor bus registers in
one scan chain. For the registers which are synthesized by
Design Compiler and placed by ICC automatically, it will be put
together in one scan chain.

Figure 2 Scan Chain without Group

Figure 2 shows the scan chain which is showed by IC Compiler


(each color shows a single scan chain), where the scan chain is
stitched automatically by DFT Compiler. We can see the scan
cells are chained randomly and the distance between two scan
cells may be very long. It will dramatically influence the
congestion and timing in 90nm process. To solve this problem,
each scan chains components are manually specified by
set_scan_group and set_scan_path in DFT Compiler. Scan
chains are completely done in what we expect. The sample
script for grouping the scan cells and set scan path are showed
below:
set_scan_group d0vby -include_elements [list
mp_vu_dp/mp_vu_vr/vu_vbyp/vbyp_sd \
mp_vu_dp/mp_vu_vr/vu_d0pipe]
set_scan_group vfa_va -include_elements [list
mp_vu_dp/mp_vu_vfa/vfax32x4 \ mp_vu_dp/mp_vu_va]
set_scan_group w01 -include_elements [list
mp_vu_dp/mp_vu_vm/mul_array0/vm_w0 \
mp_vu_dp/mp_vu_vm/mul_array0/w0pipe]

set_scan_path chain1 -include [list d0vby] -complete true


set_scan_path chain2 -include [list vfa_va] -complete true
set_scan_path chain3 -include [list w01] -complete true

Figure 3 shows the results of the VU final Scan Chains by IC


Compiler. Each scan chain is showed by different colors. It
shows that the custom placed bus registers are chained
together to reduce the distance between the scan cells. By
doing this, the total wire lengths are reduced and the timing
and congestion are dramatically improved.
And now, We can use a more efficient flow that IC compiler can
read in the scan def dumped by DFTC, then reorder the scan
chain based-on the placement result.

Figure 3 Scan Chain with Group


2.3 SDC Generation
Since the top netlist is generated by third party tool, the SDC
file is needed to generate by Prime Time (PT). PT will read all
the constraint and the top level netlist and write out the SDC
file to offer the design constraint for IC Compiler.
2.4 Physical Design by IC Compiler
Figure 4 shows the IC Compiler physical design flow. It mainly
consists of five parts: IC Compiler Floorplan, place_opt,
clock_opt, route_opt and chip finishing. Chapter 4 will
introduce the details of how to optimize the congestion and
the timing by IC Compiler.

Figure 4 ICC Physical Design Flow


2.5 Extraction by Star-RCXT & STA by Primetime-SI
After the physical design, it is necessary to extract the
parasitic like resistors, capacitors, and inductors from a fully
routed design block to analysis the timing, noise and power
etc. Star-RCXT from Synopsys is applied by VU physical design.
The output of Star-RCXT such as SPEF (Standard Parasitic
Exchange Format) will be used by PrimeTime SI to process the
timing analysis including SI analysis. The details of the timing
analysis will be introduced by chapter 5.
3.0 Custom Layout Design

To generate the custom layout of the data path logics and


register files in VU, another third party tool is used in our
design. dpGen is used by logic designers to design datapath
elements, using several build-in functions. Datapath elements
can be as simple as 32bit 2to1 Mux or as complex as a
Multiplier Array that supports 8, 16 and 32 bits multiplications.
It uses generic standard cell library to generate the desired
circuit and placement file with verilog gate level netlist.
3.1 Datapath Elements
Figure 5 shows a simple diagram of datapath flow in VU and it
shows the different types of logic in the data path and other
control logic surrounding it. The register file RF and the
pipeline bus registers are very structured, it should be put
together and placed according to the data path direction. Also,
there are some combinational logics are extremely related to
these registers, manually place these logic will greatly
optimize the congestion and timing. Therefore all these logics
are generated and placed by third party tool and the remaining
logics showed by the cloudy parts in figure 5 are synthesized
by Design Compiler. As shown in the figure 5, the data path
goes from left to right and control line is from bottom to top.

Figure 5 VU Datapath Elements


3.2 Register File Example

The following is an example of a multi-port register file build


using standard cells. The core of this register file is already
very dense, but gaps were kept intentionally every 16 bits of
the RF. The last stage of the address decoders was pre-placed.
The decoder and other control logic were kept as RTL blocks
and were synthesized. For the physical implement in ICC, after
initializing Floorplan, the placement information DEF file was
read into ICC by using read_def command. Every cell placed
by third party tool had FIXED property. The majority of cells
in this Register file were already pre-placed. ICC used the gaps
between the cells to place the synthesized control logic (red).
In addition, all CTS (yellow), IPO (green) cells were placed in
the available gaps.

(I) pre-placement (II) post ICC placement, CTS and IPO


Figure 6 VU Register File Example
3.3 Combination Datapath and Control Logic
It will leave some space in advance for the synthesized logic.
As the figure 7 shows, ICC will place the synthesis logic in the
space where we expect it will be put. Another similar example
is showed by figure 8. The synthesized logics are placed in a
central location of pre-placed cells. The space between the
pre-placed logic can be adjusted slightly increased or
decreased to achieve the most optimal space utilization
according to the cells density which is shown by ICC.

Figure 7 Synthesize logic is Placed in Empty Space

Figure 8 Random Logic Filling in the Gaps


3.4 Custom Design with ICC Relative Placement

In addition, IC compiler has a physical datapath engine that


allows the user to specify relative positioning of groups of cells
as well, just like the third party tool we used. The initial
relative positioning of cells occurs during coarse placement. As
shown in figure 10, the flow for using ICC relative placement
follows these major steps:
Read the gate-level netlist into IC Compiler by using the
import_design command.
Define the relative placement constraints:
Create the relative placement groups by using the
create_rp_group command.
Add relative placement objects to the groups by using the
add_to_rp_group command.
P.S. IC Compiler annotates the netlist with the relative
placement constraints.
Prevent relative placement cells from being removed during
optimization by using the set_size_only command.
Read floorplan information by read_def.
Perform placement for the design by using the place_opt
command.
Analyze the relative placement results.
Perform clock tree synthesis with the relative placement
structures fixed in place.
Perform routing with the relative placement structures fixed
in place.

Figure 9 Relative Placement Flow

Figure 10 Relative Placement Column and Row Positions

A relative placement group is an association of cells, other


groups, and keep outs. A group is defined by the number of
rows and columns it uses. Figure 10 shows the positions of
columns and rows in a relative placement group. Alternatively,
you can modify these options later using the
set_rp_group_options command.
Once an RP group is created, it can be used within another RP
group. This is done via the hierarchy switch on the
add_to_rp_group command. The following code example
illustrates use and creation of hierarchical RP, as shown in
figure 11. Figure 12 shows relative placement in a design
containing obstructions which is common in designs.

Figure 11 Including Groups in a Hierarchical Group

Figure 12 Relative Placement in a Design Containing


Obstructions
4.0 Physical Design by IC Compiler
ICC is a powerful tool for auto layout. There are three typical
commands: place_opt, clock_opt, route_opt which simplify the
whole layout flow and have high quality of optimization. These
three commands, just as their names imply, are used in the
stage: placement, CTS, routing. Here, we will introduce the
whole layout flow in detail.
4.1 Floorplan
As we know, floorplan is a critical step during the whole layout
design. ICC provides the command for the floorplan of none
rectangle shape and makes this type of floorplan easy. The
following is the command:
The shape size is as the following:

Figure 13 Initialize FloorPlan

The ICC command: initialize_rectilinear_block define four types


of shape style which are often used in layout design: L type, T
type, U type and X type. You can also generate random type by
yourself. If you just use normal rectangle shape, you can use
the command: initialize_floorplan. In our design, a custom
layout for data path is designed alone by other tool and we
need to put it into ICC. So after initializing floorplan, the def
file output by custom layout tool is read into ICC:
The DEF file: mp_vu_cutom.def only include the following
information: 1) The position and direction of standard cells in
custom layout 2) The ports information: position, direction,
metal, size and so on
Base on the following consideration, we use the DEF format to
get ports information but not the TDF format.
1) The TDF format can not support the none-rectangle shape
Floorplan best. 2) We need to use DEF to get the custom layout
information, so it is convenient to put the ports information in
DEF file.
The next step, the standard cells which are synthesized are
placed and routing the power nets:
Now, the floorplan is ok and we can use the following
commands to get the quality of the design:
Please read the reports by the above three commands
carefully. If any doubt, confirm with the circuit designer. The
following is our floorplan:

Figure 14 VUs Floorplan in ICC

4.2 Placement
The main command during placement in ICC is place_opt. We
use the following commands to execute the placement:
Because of the timing and congestion issues, we choose the
options: -effort high, -congestion and -area_recovery
when using the command: place_opt.
The option: -effort high can improve the timing quality but it
will take a longer runtime. -congestion and -area_recovery
options can get a better congestion result. Scan reorder is
executed by choose the option: -optimize_dft.
After the above commands, if the design still has some
congestion issue, you can use the following command:
The refine_placement command can further improve the
congestion quality but it will make the timing result worst.
Also, you can use the following command to improve the
timing:
Pay attentions here, during timing optimization in ICC, we
should extract RC before the optimization to avoid not
updating the timing information. To update the RC information
the command extract_rc estimate is used before routing
and extract_rc is directly used after routing. During
placement, group and placement blockage are the normal
methods to be used according to the request of design. ICC
also provides these commands.
1) Group: create_bounds
The command: create_bounds can generate two types of
group: hard and soft. The hard type request the elements
which are grouped to put into the group region only and the
soft type imply that some elements can be put outside of the
group region. By default, other cells cant be put into the
group region, if you want, you can choose the option: cycle_color to do so.

2) Placement blockage: create_placement_blockage


It also has two types: hard and soft. The hard placement
blockage tough restricts cells not to be placed into its region
while the soft type implies that some cells can be placed into
the region. After we take the above measures, the timing was
improved about 250ps during placement and also got a better
congestion result.
4.3 CTS
The quality of CTS directly affects the final timing results. ICC
is powerful in building clock tree.
We will descript it in detail combined with our design at the
following.
The above commands choose the buffer/inverter cells to build
clock tree. You can refer to the standard cell manual provided
by foundry to choose the suitable buffers/inverters.
During CTS, some design rules such as max_transition can be
pointed out by the following command:
In our design, we set the max_transition for clock signals to
200 ps. For the clock signals, we usually use a special routing
rule to get a better quality. The following commands define the
special routing rule used for clock routing and make the clock
tree to use it.
The top metal of our design is METAL7 (METG1). We dont use
the METAL under MET3 during CTS and we make the pace
between clock signals double of the default spacing. To control
the CTS result, we often give a skew value to ICC as the target
skew during CTS:

The target skew of our design is 300 ps. To solve the timing
issues in our design, we create a useful skew during CTS to
borrow the timing on none critical paths. By this way, we
improve the timing quality very much. To create the useful
skew, we need to choose suitable pins. None critical paths
exist before or after them. Then we first compile sub clock
trees from these pins:
In the above examples, a none-critical path exists before the
pin: mp_vu_dp/mp_vu_vfmt /perm_cc/I0_Ctg/CLK and 500 ps
can be borrowed from it. Another none critical path exists after
the pin: mp_vu_dp/mp_vu_vfa/vfax324/W0_ctg/CLK and 460
ps can be borrowed from it. If timing can be borrowed from
front, the useful skew is negative, otherwise, it is positive.
So the useful skew of the two pins are -500ps and 460ps.
After building the clock sub trees, we report the clock latency
from these pins.
From mp_vu_dp/mp_vu_vfmt/perm_cc/I0_Ctg/CLK, the latency
is 336.72.
From mp_vu_dp/mp_vu_vfa/vfax324/W0_ctg/CLK, the
latency is 353.46.
Before the whole clock tree building, we should set these pins
as float pin:
The value of float_pin_max_delay_rise is equal to the
latency of pin subtract the useful skew of pin
After the above settings complete, we will build the whole
clock tree:
There are three clocks in our design, but in the top level, these
three clocks are the same. So we use the command:
set_inter_clock_delay_options -balance_group clk gvrclk pclk
to balance these three clocks.
The following is the CTS summary:

Figure 15 CTS Summary Report


After the building of whole clock tree, we need to optimize
timing and scan reorder:
We do the optimization two times to get better quality of
results. Clock signals routing:
4.4 Routing
During routing, the main command used is route_opt which
include global routing, track assign, detail routing and
optimization.
Global route options:
First, we do an initial routing only:
After this, we extract RC information based on the initial
routing to do the actual routing and optimization:
We choose the option -area_recovery to solve congestion
issues.
If there are some routing DRC violations or timing violations
still existing after the above step, you can use the following
command to do further optimization:
The frequency of our design is 330 MHz, after the above steps,
the final timing violations is 50ps and the results in PT is OK.
The following commands insert the filler cells before dumping
gds data out:
5.0 Timing Analysis
We use the Verilog Netlist & sdc file which generated by ICC to
do STA in PT-SI. In ICCs Arnoldi mode, it has good correlation
between PT-SI & ICC, as shown below:

Figure 16 Timing Analysis in PT-SI (slack=-37.33ps)


PrimeTime SI (signal integrity) is an optional tool that adds
crosstalk analysis capabilities to the PrimeTime static timing
analyzer. We use PrimeTime SI calculates the timing effects of
cross-coupled capacitors between nets and includes the
resulting delay changes in the PrimeTime analysis reports. It
also calculates the logic effects of crosstalk noise and reports
conditions that could lead to functional failure. The main
setting in PT-SI while report_timing is shown as follows:

Figure 17 Timing Analysis in ICC (slack=-121.92ps)


The main setting in IC Compiler while report_timing is shown
as follows:
6.0 Conclusions and Recommendations
From the VU physical design we got the conclusion that the IC
Compiler provides us the benefits in effective integration of
custom layout design. It also provides us the possibility to got
higher density and better performance. Owe to Synopsys
synthesis and DFT tool, we got highly flexibility and high
performance on optimization of congestion and timing in IC
Compiler design flow. The consistency of the commands
between all the Synopsys tools also provide us easiest way to
learn tool and save much time on physical design. For better
consistency of the VU physical design flow, IC Compiler RP
(Relative Placement) will be considered for custom layout
design in C2 Microsystems next generation project.
7.0 Acknowledgements

This paper was put together with help from Synopsys


Application Consultant and the backend team of C2
Microsystems. The authors would like to thank the following
people for their supports on their kindly advice on VU physical
design and the ICC tools issue.
Alfred Jiang, Rachel Xie and Wendy Gao from C2 MicroSystems
and ZhiZhong Wang, Jianjin Hu and Tao Wang from Synopsys
AE.
8.0 References
[1] IC Compiler User Guide, Version Z-2007.03, March 2007,
Synopsys
[2] DFT Compiler User Guide: Scan, Version Z-2007.06, June
2007, Synopsys
[3] Star-RCXT User Guide, Version Z-2007.06, June 2007,
Synopsys
[4] Prime Time User Guide, Version Z-2007.06, June 2007,
Synopsys
[5] Media Processor Specification, C2 Microsystems
v

Das könnte Ihnen auch gefallen