Pipeline PDF

Speeding up the Clock
Registered Datapath
The register-to-register delay is usually the delay

path that sets the maximum clock rate
From a design point of view, can only affect the
combinational logic between the registers
D
F
F
Need to shorten the maximum combinational delay path

Setup/Hold time of registers are fixed
DFFs are rising

edge triggered
D
F
F
LOGIC
tpd
Clk Freq = 1/ (Tclk2q + Tpd + Tsu)

Latency = 1 clk
Can shorten the delay by placing a register in the

combinational logic to break longest delay path
This technique is called pipelining
Adds latency to the output (the number of clocks
between an input value and its corresponding output
result)
BR 1/99
InA
LOGIC
Tpd/2
LOGIC
D
F
F
Tpd/2
InC
Rate at which new computations are initiated

Maximum initiation rate = 1
Latency
Number of clock cycles between input value and output
value
Adding pipeline stages always increases latency
At some point, adding more pipeline stages does

not increase clock frequency because Tclk2q and
Tsu dominate delay.
InD
OutB
3
BR 1/99
Pipeline Example
8-bit
add
32-bit ripple carry adder.
CO
Longest delay path in carry chain

from byte0 to byte3
32
CI
byte1
8-bit
add
D
F
F
CO
CI
8-bit
add
B
32
D
F
F
CO
32
32
CO
16
16
16
16
8-bit
add
byte0
CO
CI
8-bit
add
CO
Sum
16
byte1
D
F
F
Sum (low 16 bits)
A
Can I use pipelining to
speed this up?
D
F
F
16
16
byte3
16
Sum (low 16 bits)

16
CI
8-bit
add
D
F
F
D
F
F
16
(high 16 bits)
byte2
CI
8-bit
add
F
32
Insert pipeline stage between byte1 and byte2.

(low 16 bits)
byte0
32
Initiation Rate - Rate at which new input values are

accepted after first output value
D
F
F
OutA
D
F
F
OutC
Definitions
BR 1/99
OutB
BR 1/99
Latency = 2 clks
InB
InD
OutA
Clk Freq = 1/ (Tclk2q + Tpd/2 + Tsu)
InA
InC
Add a pipeline stage

D
F
F
InB
byte2
CO
16
16
16
CI
8-bit
add
D
F
F
Sum (high 16 bits)

16
byte3
CO
BR 1/99
BR 1/99
Comments on Pipeline Example
Latency Tolerance
Latency tolerance is dependent upon each
application
Frequent flushing of a pipeline (discarding partial
results within the pipeline and restarting the
pipeline with a new value) wastes time and makes
an application latency intolerant.
Flushing of a pipeline introduces clock cycles in
which the results coming out of the pipeline are
ignored -- these are wasted clock cycles.
High Latency tolerance means that you can have
many pipeline stages, whatever the number you
need to meet the clock rate specification.
Note that the pipeline stage broke the carry chain

into two equal paths
Each pipeline stage should have approximately the same
combinational delay
Clock speed will be set by the delay of the slowest
pipeline stage
If I inserted 2 pipeline stages, I would need to break

the carry chain delay into equal thirds
Could insert a pipeline stage between each BIT in
order to get maximum clock speed
Called bit pipelining
BR 1/99
BR 1/99
Two Applications
BLEND Datapath without Pipelining. MULT is

combinational.
MSB
of F
A*F
Graphics hardware for processing pixels is

extremely latency tolerant - not unusual to find
pipelines that have 10s of stages.
Graphics pipelines are never flushed
High clock rate is EXTREMELY important because of
large number of pixels (> 1 Million) that have to be
supplied every screen, at > 30 updates per second
Microprocessor instruction pipelines are not very

latency tolerant - most CPU pipelines are only
about 5-10 stages.
Branch instructions can cause pipeline to be flushed. By

the time you determine direction of branch, may have
started processing instructions that should not be in the
pipeline. These are flushed and the pipeline restarted.
BR 1/99
SAT
ADDER
R
E
G
R
E
G
D
MULT
F
2/1
Mux
1-F
B* (1-F)
BR 1/99
R
E
G
MULT
R
E
G
R
E
G
MSB
of 1-F
2/1
Mux
SAT
ADDER
MULT
1-F
B* (1-F)
LPM_Mult can be pipelined via LPM_PIPELINE parameter.

Add one pipeline stage to LPM Mult. DFFs inserted
automatically in LPM_MULT. We now have a LATENCY
mismatch within our datapath!!!!
MSB
A*F
of F
R
A
2/1
E
D
Mux
MULT
G
F
R
E
G
2/1
Mux
MSB
of 1-F
BR 1/99
10
Correct the latency mismatch by adding DFFs in other path as well. May
have to break delay paths in other places or add additional pipeline stages
to LPM_mult to meet clock frequency target. Control lines (like MUX
control line), must be pipelined also.
A*F
A
R
E
G
11
D
F
F
D
MULT
F
2/1
Mux
MSB
of F
SAT
ADDER
1 clk latency
0 clk latency
R
E
G
D
F
F
R
E
G
R
E
G
D
MULT
F
2/1
Mux
1-F
B* (1-F)
BR 1/99
MSB
of 1-F
R
E
G
1 clk latency
0 clk latency
12

Pipeline PDF

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Pipeline PDF

Hochgeladen von

Copyright:

Verfügbare Formate

Speeding up the Clock

The register-to-register delay is usually the delay

Need to shorten the maximum combinational delay path

DFFs are rising

Clk Freq = 1/ (Tclk2q + Tpd + Tsu)

Can shorten the delay by placing a register in the

Rate at which new computations are initiated

At some point, adding more pipeline stages does

32-bit ripple carry adder.

Longest delay path in carry chain

Sum (low 16 bits)

Sum (low 16 bits)

Insert pipeline stage between byte1 and byte2.

Initiation Rate - Rate at which new input values are

Clk Freq = 1/ (Tclk2q + Tpd/2 + Tsu)

Add a pipeline stage

Sum (high 16 bits)

Comments on Pipeline Example

Note that the pipeline stage broke the carry chain

If I inserted 2 pipeline stages, I would need to break

BLEND Datapath without Pipelining. MULT is

Graphics hardware for processing pixels is

Microprocessor instruction pipelines are not very

Branch instructions can cause pipeline to be flushed. By

LPM_Mult can be pipelined via LPM_PIPELINE parameter.

Das könnte Ihnen auch gefallen