Beruflich Dokumente
Kultur Dokumente
R5 R3 + R4 Add Ci to product
R1 R2
Multiplier
Ai Bi Ci
R3 R4
Adder
R5
1. Suppose to execute n task there are k-segment pipeline with a clock cycle time of tp
.
2. The first task T1 requires a time equal to ktp to complete its operation using k
segments in the pipe.
3. The remaining n - 1 tasks emerge from the pipe at the rate of one task per clock
cycle and they will be completed after a time equal to (n - l) tp
4. Therefore, to complete n tasks using a k-segment pipeline requires k + (n - 1)
clock cycles.
SPEEDUP
For example:
There are 4 segments t1=60ns,t2=70ns,t3=100ns, t4=80ns and the interface registers have a
delay of tr=10ns
Using Pipeline :
Using Non-pipeline:
If a pipeline executes many tasks, the overhead of start-up and drainage can be
ignored and the effective throughput of the pipeline taken to be 1 task per cycle. In terms
of instruction execution, this corresponds to an effective CPI (cycles per instruction) of 1.
Linear pipelines are applied for instruction execution, arithmetic computation, and
memory-access operations.
Depending on the control of data flow along the pipeline, we model linear pipelines in
two categories:
1. Asynchronous
2. Synchronous.
Asynchronous Model:
2. When stage Si is ready to transmit, it sends a ready signal to stage Si+1. After stage
Si+l receives the incoming data, it returns an acknowledge signal to Si.
3. Asynchronous pipelines are useful in designing communication channels in message
passing multicomputers.
Synchronous Model
2. The latches are made with master-slave flip-flops, which can isolate inputs from
outputs.
3. Upon the arrival of a clock pulse, all latches transfer data to the next stage
simultaneously.
The pipeline stages are combinational logic circuits. It is desired to have approxi-
mately equal delays in all stages. These delays determine the clock period and thus the
speed of the pipeline.
The clock cycle τ of a pipeline is determined below. Let τ i be the time delay of
the circuitry in stage Si and d the time delay of a latch, as shown in Fig. 6.1b.
Clock Cycle and Throughput
Denote the maximum stage delay as τ m , and we can write τ as
τ = max{τ i } k + d = τ m + d
i 1
At the rising edge of the clock pulse, the data is latched to the master flip-flops of
each latch register. The clock pulse has a width equal to d. In general, τ m >> d for
one to two orders of magnitude. This implies that the maximum stage delay τ m
dominates the , clock period.
The pipeline frequency is defined as the inverse of the clock period:
1
f=
τ
If one result is expected to come out of the pipeline per cycle, f represents the
maximum throughput of the pipeline.
Tk = [k + (n - 1)] τ
Speedup Factor The speedup factor of a k-stage pipeline over an equivalent nonpipelined
processor is defined as
Τi nkτ nk
Sk = = =
Τk kτ + (n - 1)τ k + (n - 1)
The optimal choice of the number of pipeline stages should be able to maximize a
performance/cost ratio.
Let t be the total time required for a nonpipelined sequential program of a given
function. To execute the same program on a k-stage pipeline with an equal flow-
through delay t, one needs a clock period of
p = t/k + d, where d is the latch delay.
The total pipeline cost is roughly estimated by c + kh, where c covers the cost of
all logic stages and h represents the cost of each latch.
f 1
PCR = =
c + kh (t / k + d )(c + kh)
Figure 2 plots the PCR as function of k. The peak of the PCR curve corresponds
to an optimal choice for the number of desired pipeline stages:
t ⋅c
Ko =
d ⋅h
Where t is the total flow-through delay of the pipeline. The total stage cost c, the
latch delay d, and the latch cost h can be adjusted to achieve the optimal value ko.
Nonlinear Pipeline Processor
The traditional· linear pipelines are static pipelines because they are used to
perform fixed functions.
Pipeline conflicts:
There are 3 major difficulties that cause the instruction pipeline to deviate from its
normal operation.
1. Resource conflict
2. Data dependency
3. Branch difficulties
Resource conflict:
Data dependency:
When an instruction depends on the result of a previous instruction, but this result
is not available.
Branch difficulties:
Branch difficulties arise from branch and other instructions that change that value
of PC.
Example:
Instr 1 FI DA FO EX
Instr 2 FI DA FO EX
Branch Instr 3 FI DA FO EX
Instr 4 - - FI DA FO EX
Inst 5 - - - FI DA FO EX
Hazard:
1. In pipe line system, instructions IK+1,IK+2 may be started and completed before
the instruction IK is completed.
2. This difference can cause problems if not properly considered in the design of
the control.
3. Existence of such dependencies causes what is called “hazard”.
4. The hardware technique that detects and resolves hazards is called interlock.
Types of hazards:
1. Instruction hazard
2. Data hazard
3. Branching hazards
4. Structural hazards
1. INSTRUCTION HAZARDS:
2. To handle this hazard a centralized controller is required to keep the address in the
range sets for the instructions inside the pipeline.
3. Every instruction fetch, PC be compared against the possible match with the
address in the address range set used by the subsequent stages. The match with
any of the address means there is an instruction RAW hazard.
Types of hazards:-
Instruction no code
1. STORE R4,A
2. SUB R3,A
3. STORE R3,A
4. ADD R3,A
5. STORE R3,A
RAW hazard:
In the above code, the second instruction must use the value of A updated by the first
instruction. If the second instruction (SUB) reads the value A before instruction 1 has
a chance to update it, a wrong value of data will be used by the CPU. This situation is
called raw hazard.
WAR hazard:
A WAR hazard between 2 instructions i & j occurs when the instruction j attempts to
write onto some object that being read by instruction i.
The WAR hazard exists between the instructions 2 & 3, since an attempt by
instruction 3 to record a value in A before instruction 2 has read the value is clearly
wrong.
WAW hazard:
When WAW hazard between 2 instructions i & j occurs when the instruction j
attempts to write onto some object that is also required to be modified by the
instruction i.
Similarly WAW hazard occurs between the instructions 3 & 5 since an attempt by
instruction 5 to store before the store of instruction 3 is clearly incorrect.
2. DATA HAZARDS:
Data hazards occur when data is modified. Ignoring potential data hazards can result
in race conditions. There are 3 situations data hazard can occur in:
Example:
Instr 1: R3 ← R1 + R2
Instr 2: R5 ← R3 + R2
The first instruction is calculating a value to be saved in register R3 and the second is going
to use this value to compute a result for register 5.
Example:
R1←R2+R3
R3←R4+R5
We must ensure that we do not store the result of register R3 before it has had a
chance to fetch the operands.
We must delay the WB (write back) of instr2 until the execution instr1.
Eliminating hazards:
sw – store word
lw –load word
Data Forwarding:
1. The result of the second instruction is not yet stored in register $8 by the time
the third instruction needs it, the result is in fact available at the output of the
ALU. Thus, the needed value can be passed (bypass path) from the second
instruction to the third one.
For example:
Let’s say we want to write the value 3 to register 1, and then add 7 to register 1 and store the
result in register 2. i.e.
Instr -0 : register 1=6
Inst r-1 : register 1=3
Instr -2 : register 2= register 1+7 =10
2. However if instruction 1 does not completely exit the pipeline before inst-2 starts
execution, it means that the register 1 does not contain the value when instr-2
performs its addition.
3. In such an event, inst-2 adds 7 to the old value of register 1 and so register 2 would
contain 13
Operand hazard:
1. Logic to detect the hazards in operand fetches can be worked out as follow from
the definition of range and domain sets.
2. We can device a mechanism to keep track of the domain and range sets for
various instructions passing through the pipeline stages.
3. Each set will associate with one set of storage registers for the domain set and
another for the range set addresses.
4. This storage will be required for all the stages beyond the decode stage.
5. When an instruction moves from stage to stage it also carries its range and domain
set information.
Scalar processor:
The simplest processors are processor. Scalar processors represent the simplest
class of computer Microprocessors. A scalar processor processes one data item at
a time.
Vector processor:
f
t
IFIDEXMEMWBIFIDEXMEMWBIFIDEXMEMWBIFIDEXMEMWBIFIDEXMEM
WBIFIDEXMEMWBIFIDEXMEMWBIFIDEXMEMWBIFIDEXMEMWBIFIDEX
MEMWB(Superscalar Pipeline Design)
Fig. 28a. In this design, the processor can issue two instructions per cycle if there
is no resource conflict and no data dependence problem.
4. For simplicity, we assume that each pipeline stage requires one cycle, except the
execute stage which may require a variable number of cycles. Four functional
units, multiplier, adder, logic unit, and load unit, are available for use in the
execute stage.
5. The multiplier itself has three pipeline stages, the adder has two stages, and the
others each have only one stage.
lookahead window
There is a lookahead window with its own fetch and decoding logic. This
window is used for instruction lookahead in case of out-of-order instruction issue is
desired to achieve better pipeline throughput.
Limitations
Consider the example program in Fig. 28b. A dependence graph is drawn to indicate the
relationship among the instructions.
1. Because the register content in Rl is loaded by I1 and then used by I2, we have
flow dependence: 11→ 12.
2. Because the result in register R4 after executing I4 may affect the operand register
R4 used by I3, we have anti dependence: 13 → 14.
3. Since both I5 and I6 modify the register R6 and R6 supplies an operand for I6, we
have both flow and output dependence: 15 → 16 and 15 → 16 as shown in the
dependence graph.
Pipeline Stalling :
In Fig. 6.29b, we show the effect of branching (instruction I2). A delay slot of
four cycles results from a branch taken by I2 at cycle 5. Therefore, both pipelines must be
flushed before the target instructions I3 and I4 can enter the pipelines from cycle 6. Here,
In-Order Issue
Figure 30a shows a schedule for the six instructions being issued in program order Il, 12,
…. I6. Pipeline 1 receives Il, I3, and I5, and pipeline 2 receives instructions I2,I4, and I6
in three consecutive cycles. Due to I1 -> I2, I2 has to wait one cycle to use the data
loaded in by I1.
I3 is delayed one cycle for the same adder used by I2. I6 has to wait for the
result1 of I5 before it can enter the multiplier stages. In order to maintain in-order
completion; I5 is forced to wait for two cycles to come out of pipeline 1. In total, nine
cycles are needed and five idle cycles (shaded boxes) are observed.
Only three idle cycles were observed. Note that in Figs..29a and 29b, we did not
use the lookahead window. In order to shorten the total execution time, the window can
be used to reorder the instruction issues.
Out-of-Order Issue
The in-order issue and completion is the simplest one to implement. It is rarely
used today even in a conventional scalar processor due to some unnecessary delays in
maintaining program order.
1. • OL = 1 cycle
The number of cycles until the result of an instruction is available.one cycle
latency for a simple operation,
2. • IL = 1 cycle
The number of cycles required between issuing two consecutive instructions.A
base scalar processor can be defined as a machine with one instruction issued per
cycle
4. • MILP = 1.K=k
The maximum number of simultaneously executing instructions
The instruction pipeline can be fully utilized if successive instruction can enter it
continuously at the rate of one per cycle, as shown in fig.1
Time required to execute N instruction
= 4+(10-1)
=13 base cycles
Superscalar Performance:
To compare the relative performance of a superscalar processor with that of a scalar base
machine, we estimate the ideal execution time of N independent instructions through the
pipeline.
Ν−m
T(m,1) = k + (base cycle)
m
Where k is the time required to execute the first m instructions through the m
pipelines simultaneously,and the second term corresponds to the time required to execute
the remaining N-m instructions ,m per cycle ,through m pipelines.
The ideal speedup of the superscalar machine over the base machine is
Τ(1,1) Ν + k −1 m( Ν + k − 1)
S(m,1) = = =
Τ(m,1) Ν / m + k − 1 Ν + m( k − 1)
Superpipelined Design
1
T(1,n) = k + (N-1) (base cycle)
n
Τ(1,1) k + N −1 n(k + N − 1)
S(1,n) = = =
Τ(1, n) k + ( N − 1) / n nk + N − 1
• OL = n minor cycles
Single operation latency in n pipeline cycles equivalent to 1 baseline cycle
the pipeline cycle time is =1/n
= 1 baseline cycle
• IL = 1 minor cycle
= 1/n baseline cycle
Ν−m
T(m,n) = k + (base cycle)
mn
Τ(1,1) k + N −1 mn(k + N − 1)
S(m,n) = = =
Τ(m, n) k + ( N − m) /(mn) mnk + N − m
clock 0 1 2 3 4 5 6
unit0 X . . . X X .
unit1 . X . . . . .
unit2 . . X . . . .
unit3 . . . X . . .
unit4 . . . . . . X
In reservation tables, `X' means ``the function unit is busy at that clock.
and `.' means ``the function unit is not busy at that clock.
once a task enters the pipeline, it is processed by unit0 at the first clock, by unit1
at the second clock, and so on. It takes seven clock cycles to perform a task.
Example of ``conflict''
clock 0 1 2 3 4 5 6 7
unit0 0 1 . . 0 C 1 .
unit1 . 0 1 . . . . .
unit2 . . 0 1 . . . .
unit3 . . . 0 1 . . .
unit4 . . . . . . 0 1
(`0's and `1's in this table except those in the first row represent tasks 0 and 1,
respectively, and `C' means the conflict.)
Notice that no special hardware is provided to avoid simultaneous use of the same
function unit.
Therefore, a task must not be started if it would conflict with any tasks being processed.
For instance:
If two tasks, say task 0 and task 1, were started at clock 0 and clock 1, respectively, a
conflict would occur on unit0 at clock 5.
This means that you should not start two tasks with single cycle interval.
This invalid schedule is depicted in the following process table, which is obtained by
overlapping two copies of the reservation table with one being shifted to the right by 1
clock.
Reservation Analysis
1. from Sl to S2
2. from S2 to S3,
Two reservation tables are given in Figs. 3b and 3c, corresponding to a function X
and a function Y, respectively. Each function evaluation is specified by one
reservation table.
For example:
The function X requires eight clock cycles to evaluate, and function Y requires
six cycles, as shown in Figs. 3b and 3c, respectively.
Instructions can be executed in a hardware , the content of the program counter (PC) is
supplied to the instruction cache and an instruction word is read out from the specified
location.
Once an instruction has been read out from the instruction cache, its various fields are
separated and each is dispatched to the appropriate place.
b) 01 for rd
c) 10 for $31.
iii) Of course, not every instruction writes a value into a register. Writing into a
register requires that the RegWrite control signal be asserted by the control
unit; otherwise, regardless of the state of RegDst, nothing is written into the
register file.
iv) the instruction cache does not receive any control signal, because an
instruction is read out in every cycle (i.e., the clock signal serves as read
control for the instruction cache).
2. lower input of the ALU
a) control unit to choose the content of rt or the 32-bit sign-extended
version of the 16-bit immediate operand
b) The first or top input always comes from rs.
c) This is controlled by asserting or deasserting the control signal
ALUSrc. If this signal is deasserted (has a value 0), the content of rt is
used as the lower input to the ALU; otherwise, the immediate operand,
sign-extended to 32 bits, is used.
3. outputs of the ALU and data cache
a) allows the word supplied by the data cache, output from the ALU, or the incremented
PC value to be sent to the register file for writing and is effected by a pair of control
signals, RegInSrc
b) set to 00 for choosing the data cache output, 01 for the ALU output, and 10 for the
incremented PC value corning from the next-address block.
In this case the data path is capable of executing one instruction per clock cycle; hence
the name "single-cycle data path."
The ALUFunc control signal bundle contains 5 bits:
1. bit for adder control (Add,Sub)
2. bits for controlling the logic unit (Logic function)
3. 2 bits for controlling the rightmost multiplexer.
4. The ALU output signal indicating overflow in addition or subtraction.
5. In the case of arithmetic and logic instructions, the ALU result must be stored in
the destination register and is thus forwarded to the register file through the
feedback path.
6. In the case of memory access instructions, the ALU output is a data address. for
writing into the data cache (Data Write asserted) or reading from it (DataRead
asserted).
7. In the latter case, the data cache output, which appears after a short latency, is
sent through the lower feedback: path to the register file for writing.
PIPELINED DATA PATHS
Strategy 1:
Use multiple independent data paths that can accept several instructions that are
read out at once:- this leads to multiple-instruction-issue or superscalar
organization.
Strategy 2:
Overlap the execution of several instructions in the single-cycle design, starting
the next instruction before the previous one has run to completion:- this leads to
pipelined or superpipelined organization.
In this section, we aim to determine the actual performance gain by modeling the effects
of overheads and dependencies.
T Latching
Of results
Function unit
t/q
Pipelined form of a function unit with latching overhead
The overhead due to unequal stage latencies and latching of signals between
stages can be modeled by taking the stage delay to be t / q + r instead of the ideal t / q, as
in Figure 6.
Then, the increase in throughput will drop from the ideal of q to
t q
Throughput increase in a q-stage pipeline = =
(t r + r ) (1 + qr t )
So, the ideal throughput enhancement factor of q can be approached, provided the
cumulative pipeline latching overhead qr is considerably smaller than t.
Figure 7 plots the throughput improvement as a function of q for different overhead ratios
r/t.
We note that r / t (the relative magnitude of the per-stage time overhead r) both limit the
maximum performance that can be attained.
For example:
Note, however, that we have not yet considered the effects of data and control
dependencies.
1. To keep our analysis simple, we assume that most data dependencies are
resolved through data forwarding, with a single bubble inserted in the pipeline
only for a read-after-load data dependency (Figure 4).
2. Similarly, a fraction of branch instructions (those for which the compiler does
not succeed in filling the branch delay slot with a useful instruction) lead to
insertion of a single bubble in the pipeline.
3. In any case, no type of dependency results in the insertion of more than one
bubble. Let the fraction of all instructions that are followed by a bubble in the
pipeline be ….. Such bubbles effectively reduce the throughput by a factor of
1 +β, making the throughput equation
q
Throughput increase with dependencies =
(1 + qr / t )(1 + β )