Lec20 PDF

Th U The University i it of fT Texas at tD Dallas ll
Erik Jonsson School of Engineering g g and Computer Science
The Pipelined MIPS Processor

We complete our study of computer architecture by investigating an approach providing even higher performance for the MIPS CPU. We first saw how the MIPS CPU performance could be improved by converting the so-called single-cycle CPU to a multi-cycle design.
In the multi-cycle approach, instead of using a single clock cycle for the whole instruction, the clock is accelerated, and instructions execute in phases over several clock cycles. Each instruction phase takes one clock cycle. This means that as each instruction executes, only one section of the CPU will be active per clock cycle -- the one executing that phase of the instruction.
This suggests that perhaps we might redesign the CPU slightly so th t every CPU section that ti can operate t independently i d d tl on an instruction i t ti at the same time.
Lecture #20: The Pipeline MIPS Processor
N. B. Dodge 09/12
The Laundry Example

As an introduction to the concept of pipelining, Patterson and Hennessy use the example of doing ones laundry. Most people have or have access to a washer and dryer. Assume that you need to wash several washer loads of clothing. Would anyone divide the clothing into washer loads and then wash dry wash, dry, fold and put away the first load before starting the second? No, if you were washing clothes, you would finish washing the first load put it in the dryer load, dryer, and start the second load washing washing. If there were more loads to wash, you would begin to fold and put away finished clothing while the later loads were washing and drying drying. We can see this schematically on the next slide.
N. B. Dodge 09/12
Graphical Example of the Laundry Cycle
N. B. Dodge 09/12
The Pipeline Processor

Patterson and Hennessy applied this simultaneous wash-dry-foldput away concept to the single-cycle computer model. The idea was to wash wash, dry dry, fold fold, and put away away instructions simultaneously so that the instruction throughput the number of clock cycles per instructions could be dramatically decreased. In the case of the single cycle model model, one instruction is done per clock cycle, but the clock must be as slow as the slowest instruction. In the multi-cycle implementation, the clock runs faster, instructions takes 3-5 3 5 cycles cycles, but only one instruction is processed at a time time. What if, each time the clock ticked, we could process an instruction in each section of the multicycle processor? Then we could process several instructions simultaneously simultaneously, approaching the goal of completing an instruction every clock cycle.
N. B. Dodge 09/12
Pipeline Architecture
A pipelined computer executes instructions concurrently. Hardware units are organized into stages:
Execution in each stage takes exactly 1 clock period. Stages are separated by pipeline registers that preserve and pass partial results to the next stage.
Unfortunately, as noted earlier, speed = complexity + cost. Th pipeline The i li approach h brings b i additional dditi l expense plus l its it own set of problems and complications, called hazards, which we will also study.
5
N. B. Dodge 09/12
Sequential Versus Pipelined Execution

Timeline (clock cycles)
lw $t0, 16($a3) lw $t1, 32($a3) lw $t2, 48($a3) 0
Instruc. Fetch
1
Reg. Fetch
2
ALU Process
10
Mem. R/W Reg. or ALU Out Write Instruc. Fetch Reg. Fetch ALU Process Mem. R/W Reg. or ALU Out Write Instruc. Fetch Reg. AL Fetch Proc
4 clock cycles
4 clock cycles
Timeline i i (clock cycles)

lw $t0, 16($a3) lw $t1, 32($a3) lw $t2, $ 48($a3) $
etc.
0
Instruc. Fetch
1
Reg. Fetch Instruc. Fetch
2
ALU Process Reg. Fetch Instruc. Fetch
10
5 clock cycles
Mem. R/W Reg. or ALU Out Write ALU Process Reg. Fetch Mem. R/W Reg. or ALU Out Write ALU Process Mem. R/W Reg. or ALU Out Write
etc.
N. B. Dodge 09/12
Speed Advantage of the Pipeline

Th multicycle, The l i l serial i l processor that h we studied di d last l lecture l can execute n instructions in ns clock periods, or
ET is the execution time and s is the number of stages. A pipelined processor with s stages can execute n instructions in ETP = s + (n 1) ) clock p periods.
ETS = ns, where
The ideal pipeline speedup depends on the number of stages, and can be greater for more stages (hence Intels choice of a 20-stage pipeline p p for the current P-IV). ) Thus the speed advantage of pipeline over multicycle can be defined as:
ns SP ETs ETP s (n 1)
n s
N. B. Dodge 09/12
Pipeline Stages
Cl k cycles Clock l 0 1 2 3 4 5
IF
ID/ RF
ALU
MEM
WB
The MIPS R2000 pipeline processor is divided into five processing stages:
1. Instruction fetch (IF) 2. Instruction decode (ID) and register fetch (RF) 3. ALU instruction execution (ALU) ALU processing, branch condition evaluation, memory address computation, etc. This is also referred to as execution (EX) 4 Memory access (MEM) 4. 5. Write back (WB) to register file
N. B. Dodge 09/12
Overlapped Pipeline Execution

Clock cycles 0 1 2 3 4 5 6 7
IF
ID/ RF IF Instruction 3
ALU ID/ RF IF
MEM ALU ID/ RF
WB MEM ALU
Instruction 1 WB MEM Instruction 2 WB
Instruction execution order

9
N. B. Dodge 09/12
Single-Cycle Datapath
Reg. Dest.
32
32
Branch Mem. Read Mem. To Reg. ALU Op. Mem. Write ALU Srce. Reg. Write
+4
ADD
6 (Bits 26-31)
32
Control
ADD
Left shift 2 32
M 32 U X
Instruction bits s 0-31
5 5
R Rs Rt M 5 U Rd X Read Data 1 Read Data 2 32 32

Write Read Mem./Reg. Select
P C
Instruction Address Inst. 0 31 0-31 Instruction Memory
Write Data Reg. Block 16 (Bits 0-15)
M 32 U X
ALU
32
Data Address Write Data
Read 32 Data 32 Data Memory
M 32 U X
10
Lines indicate need for storage between stages if processor is converted to pipeline
Sign 32 Extend
ALU 6( (Bits i 0-5) 0 ) Control
N. B. Dodge 09/12
Single-Cycle Datapath with Pipeline Registers

M U X
Inter-stage registers are master-slave D flip-flops; the master can be receiving new data from the previous stage of the instruction while the slave flip-flop is providing data to the next stage
+4
ADD Memory Reg. Block

Rs Rt Rd Write Data 16 Read Data 1 Read Data 2 M U X Left shift 2
ADD
Compare result
P C
Instruction I t ti Address Inst. 0-31
Memory ALU
Data Address Read Data M U X
Master side of register Slave side of register

Note: Control lines and logic not shown for clarity
IF/ID
Write Data
Sign 32 Extend
ID/EX EX/MEM MEM/WB N. B. Dodge 09/12
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
11
Instruction Process Through Pipeline (1)

M U X
+4

Rs Rt Rd Write Data Read Data 1 Read Data 2 M U X Left shift 2
ADD
Compare result
P C
Instruction I t ti Address Inst. 0-31
Memory ALU
Stage 1: Instruction loaded into IF/ID register, PCPC+4

IF/ID
Write Data 16
Sign 32 Extend
12

M U X
Stage 2 St 2: I Instruction t ti decoded, register data accessed, immediates sign-extended
+4

R Rs Rt Rd Write Data 16 Read Data 1 Read Data 2 M U X Left shift 2
ADD
Compare result
P C
Instruction Address Inst. 0-31
Memory ALU
Write Data
Sign 32 Extend
IF/ID
13

M U X
Stage 3: Instruction executed / branch address computed
+4

ADD
Compare result
P C
Memory ALU
Write Data
Sign 32 Extend
IF/ID
14

M U X
Stage 4: Memory load or store, branch taken/not taken ALU results bypass taken to MEM/WB register
+4

ADD
Compare result
P C
Memory ALU
Write Data
Sign 32 Extend
IF/ID
15

M U X
+4

ADD
Compare result
P C
Memory ALU
Write Data
Sign 32 Extend
ID/EX
IF/ID
Stage 5: Result write-back to dest. register
EX/MEM
MEM/WB
16
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition N. B. Dodge 09/12 Lecture #20: The Pipeline MIPS Processor
Adding Control
Control information must be carried along as a part of the instruction, since this information is required at diff different stages of f the h pipeline. i li This can be done by adding more inter-stage storage register bits to forward control data yet to be used. The result is very large inter-stage registers. For example, the storage capacity required between the instruction decode and ALU execution stages (ID/EX register) is more than 120 bits. The resulting processor with full control functionality is shown on the next slide
17
N. B. Dodge 09/12

ID/EX M U X IF/ID
Register Write

EX/MEM MEM/WB
Control Decode
Memory
Reg. Block
Rs Rt Rd Write Data Read Data 1 Read Data 2
Branch ALU Srce
P C
Instruction Address Inst. 0 31 0-31
M U X
ALU
Data Address
Read Data
Full Pipeline Design with Control Lines

18
Bits 0-15 Bits 16-20 Bits 11-15
Sign 32 Extend
M U X
Write Data ALU Cont.

ALU Op Reg. Dst.
Memory
After David A. Patterson John L.09/12 Hennessy, N.and B. Dodge Computer Organization and Design, 2nd Edition
Memory/ALU Result
Left shift 2
Memory Wri ite Memory Rea ad
+4
Instruction bits 0 0-31
ADD
ADD
M U X
The Pipeline in Action

The following instruction sequence from the P&H text illustrates the pipeline in action. l $10, lw $10 20($1) sub $11, $2, $3 and $12, $12 $4, $4 $5 or $13, $6, $7 $14, ,$ $8, ,$ $9 add $ Note that registers are identified by number rather than the letter ids, since that is the way they appear in th MIPS processor. As the A a reminder, i d $1 $1=$at, $ t $8 $8-14=$t014 $t0 t6, $2-3=$v0-v1, $4-7=$a0-a3, etc. N. B. Dodge 09/12
19
IF: Idle
ID/RF: Idle
ID/EX M U X IF/ID
Register W Write
EX: Idle
EX/MEM
MEM: Idle
WB: Idle
MEM/WB
Control Decode
Memory
Reg. Block
Rs Rt Rd Write Data Bit 0-15 Bits 0 15 Bits 16-20 Bits 11-15 Read Data 1 Read Data 2
Branch ALU Srce
P C
M U X 32
ALU
Data Address
Read Data
Si Sign Extend
Write D t Data ALU Cont. M U X

ALU Op R D Reg. Dst.
Memory
20
Memory/ALU Result M t
Left L f shift 2
Memory W Write Memory Re ead
+4
Instruction bits s 0-31
ADD
ADD
M U X
IF: lw $10, 20($1)
ID/RF: Idle
EX: Idle
MEM: Idle
WB: Idle
MEM/WB
ID/EX M U X IF/ID
Register Write
EX/MEM
Control Decode
Memory
Reg. Block
Rs Rt Rd Write Data Bits 0-15 Bits 16-20 Bits 11-15 Read Data 1 Read Data 2
Branch ALU Srce
P C
M U X
ALU
Data Address
Read Data
Sign 32 Extend
M U X

ALU Op Reg. Dst.
Memory
21
Memory/ALU Result
Left shift 2
+4
ADD
ADD
M U X
IF: sub $11, $2, $3
ID/RF: lw $10, 20($1)
EX: Idle
MEM: Idle
WB: Idle
MEM/WB
ID/EX M U X IF/ID
Register Write
EX/MEM
Control Decode
Memory
Reg. Block
$1 $ 10
Rs Rt Rd Write Data Bits 0-15 Bits 16-20 Bits 11-15 Read [ $1 ] Data 1 Read Data 2 X
Branch ALU Srce
P C
M U X
ALU
Data Address
Read Data
Sign 20 Extend
$ 10 X M U X

ALU Op Reg. Dst.
Memory
22
Memory/ALU Result
Left shift 2
+4
ADD
ADD
M U X
IF: and $12, $4, $5
ID/RF: sub $11, $2, $3
EX: lw $10, 20($1)
MEM: Idle
WB: Idle
MEM/WB
ID/EX M U X IF/ID
Register Write
EX/MEM
Control Decode
Memory
Reg. Block
$2 $3
Rs Rt Rd Write Data Bits 0-15 Bits 16-20 Bits 11-15 Read [ $ 2 ] Data 1 Read [ $ 3 ] Data 2
Branch ALU Srce
P C
[ $1 ]
M U X 20 $ 10
ALU
20 add
Data Address
Read Data
Sign X Extend
X $ 11
Write Data ALU Cont. M U X

ALU Op $ 10 Reg. Dst.
Memory
23
Memory/ALU Result
Left shift 2
+4
ADD
ADD
M U X
IF: or $13, $6, $7
ID/RF: and $12, $4, $5
EX: sub $11, $2, $3
MEM: lw $10, 20($1)
WB: Idle
ID/EX M U X IF/ID
Register Write
EX/MEM
MEM/WB
Control Decode
Memory
Reg. Block
$4 $5 Rs Rt Rd Write Data Bits 0-15 Bits 16-20 Bits 11-15 Read Data 1 Read Data 2
Branch ALU Srce
P C
[ $2 ] [$3] M U X
ALU
[ $3 ] sub
Data Address
Read Data
Sign X Extend
X $ 12 $ 11 M U X

Memory
$ 10
24
Memory/ALU Result
Left shift 2
+4
ADD
ADD
M U X
IF: add $14, $8, $9
ID/RF: or $13, $6, $7
EX: and $12, $4, $5 MEM: sub $11, $2, $3 WB: lw $10, 20($1)
ID/EX EX/MEM MEM/WB
M U X IF/ID
Register Write
Control Decode
Memory
Reg. Block
$6 $7 Rs Rt Read Data 1 Read Data 2 [ $6 ] [ $7 ]
Branch ALU Srce
P C
[ $4 ] [$5] M U X
$10 Rd Write Data Bits 0-15 Bits 16-20 Bits 11-15
ALU
[ $5 ] and
Data Address
Read Data
X Sign 32 Extend X $ 13 M $ 12 U X

Memory
$ 11 $ 10
25
Memory/ALU Result
Left shift 2
+4
ADD
ADD
M U X
IF: Idle
ID/RF: add $14, $8, $9
EX: or $13, $6, $7
MEM: and $12, $4, $5
WB: sub $11, $2, $3
ID/EX M U X IF/ID
Register Write
EX/MEM
MEM/WB
Control Decode
Memory
Reg. Block
$8 $9 Rs Rt Read [ $8 ] Data 1 Read Data 2 [ $9 ]
Branch ALU Srce
P C
[ $6 ] [$7] M U X
$11 Rd Write Data Bits 0-15 Bits 16-20 Bits 11-15
ALU
[ $7 ] or
Data Address
Read Data
Sign X Extend
X $ 14 $ 13 M U X

Memory
$ 12 $ 11
26
Memory/ALU Result
Left shift 2
+4
ADD
ADD
M U X
IF: Idle
ID/RF: Idle (WB)
EX: add $14, $8, $9
MEM: or $13, $6, $7 WB: and $12, $4, $5

MEM/WB
ID/EX M U X IF/ID
Register Write
EX/MEM
Control Decode
Memory
Reg. Block
Rs Rt $12 Rd Write Data Bits 0-15 Bits 16-20 Bits 11-15 Read Data 1 Read Data 2
Branch ALU Srce
P C
[ $8 ] [$9] M U X
ALU
[ $9 ] add
Data Address
Read Data
Sign 32 Extend
M U X
Write Data ALU Cont. $ 14

Memory
$ 13 $ 12
27
Memory/ALU Result
Left shift 2
+4
ADD
ADD
M U X
IF: Idle
ID/RF: Idle (WB)
EX: Idle
MEM: add $14, $8, $9
WB: or $13, $6, $7
ID/EX M U X IF/ID
Register Write
EX/MEM
MEM/WB
Control Decode
Memory
Reg. Block
Branch ALU Srce
P C
M U X
ALU
Data Address
Read Data
Sign 32 Extend
M U X

ALU Op Reg. Dst.
Memory
$ 14 $ 13
28
Memory/ALU Result
Left shift 2
+4
ADD
ADD
M U X
IF: Idle
ID/RF: Idle (WB)
EX: Idle
MEM: Idle
WB: add $14, $8, $9

MEM/WB
ID/EX M U X IF/ID
Register Write
EX/MEM
Control Decode
Memory
Reg. Block
Branch ALU Srce
P C
M U X
ALU
Data Address
Read Data
Sign 32 Extend
M U X

ALU Op Reg. Dst.
Memory
$ 14
29
Memory/ALU Result
Left shift 2
+4
ADD
ADD
M U X
IF: Idle
ID/RF: Idle
EX: Idle
MEM: Idle
WB: Idle
MEM/WB
ID/EX M U X IF/ID
Register Write
EX/MEM
Control Decode
Memory
Reg. Block
Branch ALU Srce
P C
M U X
ALU
Data Address
Read Data
Sign 32 Extend
M U X

ALU Op Reg. Dst.
Memory
30
Memory/ALU Result
Left shift 2
+4
ADD
ADD
M U X
Pipeline Processor Operation Summary

Pipelining replaces the single-cycle processor with a row of five mini-processors mini processors, each capable of completing one part of each instruction. A new instruction is started every clock cycle. Inter-process registers store instruction information (data, write register, branch conditions) between cycles so that the entire instruction instruction envelope envelope is passed between the pipeline stages. When the pipeline is filled with instructions, an instruction completes every clock cycle.
31
N. B. Dodge 09/12
Exercise 1
On the diagram on the next page, identify the f following: i
1. Highlight all the control lines that must be active during a load word instruction. 2. As in our exercise in Lecture 20, identify the decoder locations. 3. The ID/EX Register g interface stores the most bits of any y of the pipeline section interfaces. Approximately how many bits is that, according to the diagram?
32
N. B. Dodge 09/12
Print out a copy of this diagram and bring to class.

ID/EX M U X IF/ID
Reg gister Write
EX/MEM
MEM/WB
Control Decode
Memory
Reg. Block
Branch ALU Srce
P C
M U X
ALU
Data Address
Read D t Data
Sign 32 Extend
M U X
Write Data ALU Cont. Cont

ALU Op Reg. Dst.
Memory
Memory/ALU U Result
Left shift 2
Mem mory Write Memory Read
+4
Instruct tion bits 0-31
ADD
ADD
M U X
Hazards
Hazards occur because data required for executing the current instruction may not be available available. An instruction in the register fetch cycle may need data from a register whose value will be changed by an instruction downstream but still in process in the pipeline (in the ALU, memory/memory bypass or writeback w eb c cycle). cyc e). Thus an upstream instruction could access a register and get incorrect data because the register data has not yet t been b updated d t db by a d downstream t i instruction. t ti
35
N. B. Dodge 09/12
Hazards (2)
There are two types of hazards, data hazards, and control hazards. Both occur because an instruction in the ID/RF stage of the MIPS pipeline needs register data that will be shortly updated by instructions in the EX or MEM/Bypass, or WB stage. Data hazards occur when an instruction needs register contents for an arithmetic/ logical/memory instruction. Control hazards occur when a branch instruction is pending and the data necessary to initiate/bypass the branch is not yet available in the same sort of scenario.
36
N. B. Dodge 09/12
Timeline Ti li (clock cycles)

sub $2, $1, $3 and $12, $2, $5 $13, ,$ $6, ,$ $2 or $ add $14, $2, $2 sw $15, 100($2)
Data Hazard in the Pipeline

0
Instruc. Fetch
1
2
10
5 clock cycles
Mem. R/W Reg. or ALU Out Write ALU Process Reg. Fetch Instruc. Fetch Mem. R/W Reg. or ALU Out Write ALU Process Reg. Fetch Instruc. Fetch Mem. R/W Reg. or ALU Out Write ALU Process Reg. Fetch Mem. R/W Reg. or ALU Out Write ALU Process Mem. R/W Reg. or ALU Out Write

37
In the instruction sequence above, the last four instructions require data from $2, which is changed in the first instruction. The $2 data will not be rewritten until cycle 4, so the AND and OR (2nd and d 3rd instructions) i t ti ) will ill fetch f t hi incorrect t data d t f from $2 $2. Even the add may not get the correct information (sw is okay).
N. B. Dodge 09/12
Control Hazards in the Pipeline

sub $2, $1, $3 blt $2, $8, wait g $2, $ , $7, $ ,g go bgt add $14, $2, $2 sw $15, 100($2) 0
Instruc. Fetch
1
2
10
5 clock cycles
Mem. R/W Reg. or ALU Out Write ALU Process Reg. Fetch Instruc. Fetch Mem. R/W Reg. or ALU Out Write ALU Process Reg. Fetch Instruc. Fetch Mem. R/W Reg. or ALU Out Write ALU Process Reg. Fetch Mem. R/W Reg. or ALU Out Write ALU Process Mem. R/W Reg. or ALU Out Write

38
Here the problem is changed, with two branch instructions added. Neither branch instruction may be executed correctly, once again because the new $2 data will not be ready ready. This wrong data could cause an incorrect branch.
N. B. Dodge 09/12
Forwarding as a Solution to Data Hazards

Cl k cycles Clock l 0 1 2 3 4 5
IF 2

ID/ RF IF
ALU ID/ RF
MEM ALU
WB MEM WB
39
One solution to the problem of data hazards is forwarding. Forwarding uses the fact that although instruction 2 needs register data two clock cycles before instruction 1 enters the WB stage, that data is already available as the output of the ALU. If a mechanism were available, instruction 1 could forward required register data after its ALU cycle to the ID/RF cycle of instruction 2.
N. B. Dodge 09/12
Forwarding Unit in the Pipeline

ID/EX Rs Rt Rd Write Data Read Data 1 M U X Forward A Read Data 2 M U X Forward B Rs Rt Rd M U X EX/MEM Register Rd EX/MEM MEM/WB
ALU
Data Address
Read Data
Reg. Block
M U X
Write W it Data
Memory
Forwarding Unit
MEM/WB Register Rd
40
N. B. Dodge 09/12
Forwarding Unit Operation

Reg. Block ALU Memory
Forwarding Unit

41
The forwarding unit samples register ids in the EX/MEM and MEM/WB registers to determine if source registers in the ID/RF cycle l are the th same. If so, source register data is replaced by pipeline (as yet unwritten) data by the forwarding unit. The correct information is thus processed and the instruction can proceed to correct execution.
N. B. Dodge 09/12
Stalls
Forwarding will not always solve the problems of data hazards. For example, suppose an add instruction follows a load word (lw), and the add involves the register that receives the memory data. In this case, forwarding will not work. The reason is that the data must be read from memory memory, and so it will not be available until the end of the MEM cycle. Thus the required data is not available for a forward, and the add instruction. s uc o . if it p proceeds, oceeds, will w process p ocess the ew wrong o gd data. . A solution to this problem is the stall. A stall halts the instruction awaiting data, while the key instruction (a lw in this case) proceeds to the end of the MEM cycle, after which the desired data is available to the add.
N. B. Dodge 09/12
42

lw $2, 32($3) add $14, $6, $2 $15, , 80($2) ($ ) sw $
Result of Stall Approach

0
Instruc. Fetch
1
2
10
5 clock cycles
Mem. R/W Reg. or ALU Out Write ALU Process Reg. Fetch Mem. R/W Reg. or ALU Out Write ALU Process Mem. R/W Reg. or ALU Out Write
43
Consider the 3 instructions above, the last two depending on the lw. $2 contents will be available at the beginning of the WB stage in the first instruction, but not before. A solution is to let the lw proceed down the pipe, pipe while the add and sw instructions hold place for one cycle. N. B. Dodge 09/12
Timeline (clock cycles)

lw $2, 32($3)
Result of Stall Approach (2)

0
Instruc. Instruc Fetch
1
Reg Reg. Fetch
2
ALU Process Instruc. Fetch
10
5 clock cycles
Mem R/W Reg. Mem. Reg or ALU Out Write Reg. Fetch Instruc. Fetch ALU Process Reg. Fetch Mem. R/W Reg. or ALU Out Write ALU Process Mem. R/W Reg. or ALU Out Write
add $14, $6, $2 (delayed 1 count) sw $15, 80($2) (delayed 1 count)
44
With the delay, the lw result feeds the ALU input stage of the add instruction, and the fetch stage of the sw. Note that forwarding in still required (this time from the MEM/WB interface, not the ALU output). However, However in addition to forwarding, forwarding instructions following a lw must also be delayed for one clock cycle. N. B. Dodge 09/12
Other Problems With Branches

A remaining problem is what to do about instructions following a branch. Even assuming forwarding and stalls, the branch/no branch decision is not made until the third stage. This means that in the MIPS pipeline, two following instructions will enter the pipe before the branch/no branch decision is made. What if:
The following instructions were for the case of branch taken and the branch was not taken. The following instructions were for branch not taken and it was t k taken.

45
In either case, the wrong instructions are in the pipe and they must be eliminated (flushed). How can this problem be prevented? Af few approaches h to the h problem bl are shown h in i the h following f ll i slides. lid
N. B. Dodge 09/12
Control Hazard Approaches (1)

MIPS R-2000 Pipeline Processor
IF ID/RF ALU/EX
(Branch)
MEM/ Bypass
WB
Direction of pipeline flow
One approach is to always assume the branch is (or is not) taken:

Say we assume the branch is never taken. Then if the instruction in ALU/EX is a branch, the instructions in IF and ID/RF will be those in the not taken program line (branch determination is made in ALU/EX). It thi this assumption ti is i correct, t the th pipeline i li will ill continue ti to t flow fl without ith t delay. d l When the branch is taken, instructions in IF and ID/RF must be flushed, usually by changing the op code of those instructions to a nop and letting them proceed to the end of the pipe. Clearly, a 2-clock time delay is involved here, and it would be worse for longer pipelines (P-IV pipeline ~ 20 stages).
46
N. B. Dodge 09/12

IF ID/RF
Branch
Branch Comparator
ALU/EX
MEM/ Bypass
WB
Reducing the cost of taking the branch:

In this case, a branch assumption is still made (taken or not taken). The difference is that since register contents (and/or immediates) are identified in the ID/RF stage, a comparator can be added there to do the branch/no-branch determination. W With the e branch b c determination de e o made de in this se early ys stage, ge, o only yo one e instruction must be flushed, in the IF stage (only a 1-instruction delay).
47
N. B. Dodge 09/12

IF ID/RF
Branch
ALU/EX Branch History
MEM/ Bypass
WB
Branch feedback based on History
D Dynamic i b branch h prediction di ti b based d on recent tb branch h hi history: t

In this approach, an indicator bit (0/1) gives the last branch condition. The next branch can be made according to the bit setting. This is useful in highly repetitive loops, loops which may continue for a long time until a substantial number of calculations are complete. Some schemes use 2 bits and do not change the prediction until the predictor is wrong twice, after which the alternate behavior is chosen. In either case, incorrect predictions will still be made, but hopefully not as often.
48
N. B. Dodge 09/12
Exercise 2
1. Explain forwarding in your own words. 2 Why doesnt forwarding always work? How can this 2. problem be solved? 3. Why y could 2-bit dynamic y branch prediction p work to ensure about a 1% error rate in branch prediction in a subroutine that loops about 100 times before completion? Hint: Assume that the subroutine is called frequently, and that it always executes 100 or more loop traversals before returning to the calling program.
49
N. B. Dodge 09/12

Lec20 PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Lec20 PDF

Hochgeladen von

Copyright:

Verfügbare Formate

Th U The University i it of fT Texas at tD Dallas ll

Erik Jonsson School of Engineering g g and Computer Science

The Pipelined MIPS Processor

Th U The University i it of fT Texas at tD Dallas ll

Erik Jonsson School of Engineering g g and Computer Science

The Laundry Example

Th U The University i it of fT Texas at tD Dallas ll

Erik Jonsson School of Engineering g g and Computer Science

Graphical Example of the Laundry Cycle

Lecture #20: The Pipeline MIPS Processor

Th U The University i it of fT Texas at tD Dallas ll

Erik Jonsson School of Engineering g g and Computer Science

The Pipeline Processor

Th U The University i it of fT Texas at tD Dallas ll

Erik Jonsson School of Engineering g g and Computer Science

Lecture #20: The Pipeline MIPS Processor

Th U The University i it of fT Texas at tD Dallas ll

Erik Jonsson School of Engineering g g and Computer Science

Sequential Versus Pipelined Execution

Timeline i i (clock cycles)

Lecture #20: The Pipeline MIPS Processor

Th U The University i it of fT Texas at tD Dallas ll

Erik Jonsson School of Engineering g g and Computer Science

Speed Advantage of the Pipeline

ETS = ns, where

Lecture #20: The Pipeline MIPS Processor

Th U The University i it of fT Texas at tD Dallas ll

Erik Jonsson School of Engineering g g and Computer Science

Lecture #20: The Pipeline MIPS Processor

Th U The University i it of fT Texas at tD Dallas ll

Erik Jonsson School of Engineering g g and Computer Science

Overlapped Pipeline Execution

MEM ALU ID/ RF

Instruction 1 WB MEM Instruction 2 WB

Instruction execution order

Lecture #20: The Pipeline MIPS Processor

Th U The University i it of fT Texas at tD Dallas ll

Erik Jonsson School of Engineering g g and Computer Science

Instruction bits s 0-31

R Rs Rt M 5 U Rd X Read Data 1 Read Data 2 32 32

Instruction Address Inst. 0 31 0-31 Instruction Memory

Write Data Reg. Block 16 (Bits 0-15)

Data Address Write Data

Read 32 Data 32 Data Memory

ALU 6( (Bits i 0-5) 0 ) Control

Lecture #20: The Pipeline MIPS Processor

Th U The University i it of fT Texas at tD Dallas ll

Erik Jonsson School of Engineering g g and Computer Science

Single-Cycle Datapath with Pipeline Registers

ADD Memory Reg. Block

Instruction I t ti Address Inst. 0-31

Master side of register Slave side of register

Lecture #20: The Pipeline MIPS Processor

Th U The University i it of fT Texas at tD Dallas ll

Erik Jonsson School of Engineering g g and Computer Science

Instruction Process Through Pipeline (1)

ADD Memory Reg. Block

Instruction I t ti Address Inst. 0-31

Stage 1: Instruction loaded into IF/ID register, PCPC+4

Lecture #20: The Pipeline MIPS Processor

Th U The University i it of fT Texas at tD Dallas ll

Erik Jonsson School of Engineering g g and Computer Science

Instruction Process Through Pipeline (2)