Beruflich Dokumente
Kultur Dokumente
This suggests that perhaps we might redesign the CPU slightly so th t every CPU section that ti can operate t independently i d d tl on an instruction i t ti at the same time.
Lecture #20: The Pipeline MIPS Processor
N. B. Dodge 09/12
N. B. Dodge 09/12
Pipeline Architecture
A pipelined computer executes instructions concurrently. Hardware units are organized into stages:
Execution in each stage takes exactly 1 clock period. Stages are separated by pipeline registers that preserve and pass partial results to the next stage.
Unfortunately, as noted earlier, speed = complexity + cost. Th pipeline The i li approach h brings b i additional dditi l expense plus l its it own set of problems and complications, called hazards, which we will also study.
5
N. B. Dodge 09/12
1
Reg. Fetch
2
ALU Process
10
Mem. R/W Reg. or ALU Out Write Instruc. Fetch Reg. Fetch ALU Process Mem. R/W Reg. or ALU Out Write Instruc. Fetch Reg. AL Fetch Proc
4 clock cycles
4 clock cycles
etc.
0
Instruc. Fetch
1
Reg. Fetch Instruc. Fetch
2
ALU Process Reg. Fetch Instruc. Fetch
10
5 clock cycles
Mem. R/W Reg. or ALU Out Write ALU Process Reg. Fetch Mem. R/W Reg. or ALU Out Write ALU Process Mem. R/W Reg. or ALU Out Write
etc.
N. B. Dodge 09/12
ET is the execution time and s is the number of stages. A pipelined processor with s stages can execute n instructions in ETP = s + (n 1) ) clock p periods.
The ideal pipeline speedup depends on the number of stages, and can be greater for more stages (hence Intels choice of a 20-stage pipeline p p for the current P-IV). ) Thus the speed advantage of pipeline over multicycle can be defined as:
ns SP ETs ETP s (n 1)
n s
N. B. Dodge 09/12
Pipeline Stages
Cl k cycles Clock l 0 1 2 3 4 5
IF
ID/ RF
ALU
MEM
WB
The MIPS R2000 pipeline processor is divided into five processing stages:
1. Instruction fetch (IF) 2. Instruction decode (ID) and register fetch (RF) 3. ALU instruction execution (ALU) ALU processing, branch condition evaluation, memory address computation, etc. This is also referred to as execution (EX) 4 Memory access (MEM) 4. 5. Write back (WB) to register file
N. B. Dodge 09/12
IF
ID/ RF IF Instruction 3
ALU ID/ RF IF
WB MEM ALU
N. B. Dodge 09/12
Single-Cycle Datapath
Reg. Dest.
32
32
Branch Mem. Read Mem. To Reg. ALU Op. Mem. Write ALU Srce. Reg. Write
+4
ADD
6 (Bits 26-31)
32
Control
ADD
Left shift 2 32
M 32 U X
5 5
P C
M 32 U X
ALU
32
M 32 U X
10
Lines indicate need for storage between stages if processor is converted to pipeline
Sign 32 Extend
N. B. Dodge 09/12
Inter-stage registers are master-slave D flip-flops; the master can be receiving new data from the previous stage of the instruction while the slave flip-flop is providing data to the next stage
+4
ADD
Compare result
P C
Memory ALU
Data Address Read Data M U X
Write Data
Sign 32 Extend
ID/EX EX/MEM MEM/WB N. B. Dodge 09/12
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
11
+4
ADD
Compare result
P C
Memory ALU
Data Address Read Data M U X
Write Data 16
Sign 32 Extend
ID/EX EX/MEM MEM/WB N. B. Dodge 09/12
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
12
+4
ADD
Compare result
P C
Memory ALU
Data Address Read Data M U X
Write Data
Sign 32 Extend
ID/EX EX/MEM MEM/WB N. B. Dodge 09/12
IF/ID
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
13
+4
ADD
Compare result
P C
Memory ALU
Data Address Read Data M U X
Write Data
Sign 32 Extend
ID/EX EX/MEM MEM/WB N. B. Dodge 09/12
IF/ID
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
14
Stage 4: Memory load or store, branch taken/not taken ALU results bypass taken to MEM/WB register
+4
ADD
Compare result
P C
Memory ALU
Data Address Read Data M U X
Write Data
Sign 32 Extend
ID/EX EX/MEM MEM/WB N. B. Dodge 09/12
IF/ID
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
15
+4
ADD
Compare result
P C
Memory ALU
Data Address Read Data M U X
Write Data
Sign 32 Extend
ID/EX
IF/ID
EX/MEM
MEM/WB
16
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition N. B. Dodge 09/12 Lecture #20: The Pipeline MIPS Processor
Adding Control
Control information must be carried along as a part of the instruction, since this information is required at diff different stages of f the h pipeline. i li This can be done by adding more inter-stage storage register bits to forward control data yet to be used. The result is very large inter-stage registers. For example, the storage capacity required between the instruction decode and ALU execution stages (ID/EX register) is more than 120 bits. The resulting processor with full control functionality is shown on the next slide
17
N. B. Dodge 09/12
Control Decode
Memory
Reg. Block
Rs Rt Rd Write Data Read Data 1 Read Data 2
P C
M U X
ALU
Data Address
Read Data
Sign 32 Extend
M U X
Memory
After David A. Patterson John L.09/12 Hennessy, N.and B. Dodge Computer Organization and Design, 2nd Edition
Memory/ALU Result
Left shift 2
+4
ADD
ADD
M U X
19
IF: Idle
ID/RF: Idle
ID/EX M U X IF/ID
Register W Write
EX: Idle
EX/MEM
MEM: Idle
WB: Idle
MEM/WB
Control Decode
Memory
Reg. Block
Rs Rt Rd Write Data Bit 0-15 Bits 0 15 Bits 16-20 Bits 11-15 Read Data 1 Read Data 2
P C
M U X 32
ALU
Data Address
Read Data
Si Sign Extend
Memory
20
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
Memory/ALU Result M t
Left L f shift 2
+4
ADD
ADD
M U X
ID/RF: Idle
EX: Idle
MEM: Idle
WB: Idle
MEM/WB
ID/EX M U X IF/ID
Register Write
EX/MEM
Control Decode
Memory
Reg. Block
Rs Rt Rd Write Data Bits 0-15 Bits 16-20 Bits 11-15 Read Data 1 Read Data 2
P C
M U X
ALU
Data Address
Read Data
Sign 32 Extend
M U X
Memory
21
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
Memory/ALU Result
Left shift 2
+4
ADD
ADD
M U X
EX: Idle
MEM: Idle
WB: Idle
MEM/WB
ID/EX M U X IF/ID
Register Write
EX/MEM
Control Decode
Memory
Reg. Block
$1 $ 10
Rs Rt Rd Write Data Bits 0-15 Bits 16-20 Bits 11-15 Read [ $1 ] Data 1 Read Data 2 X
P C
M U X
ALU
Data Address
Read Data
Sign 20 Extend
$ 10 X M U X
Memory
22
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
Memory/ALU Result
Left shift 2
+4
ADD
ADD
M U X
MEM: Idle
WB: Idle
MEM/WB
ID/EX M U X IF/ID
Register Write
EX/MEM
Control Decode
Memory
Reg. Block
$2 $3
Rs Rt Rd Write Data Bits 0-15 Bits 16-20 Bits 11-15 Read [ $ 2 ] Data 1 Read [ $ 3 ] Data 2
P C
[ $1 ]
M U X 20 $ 10
ALU
20 add
Data Address
Read Data
Sign X Extend
X $ 11
Memory
23
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
Memory/ALU Result
Left shift 2
+4
ADD
ADD
M U X
WB: Idle
ID/EX M U X IF/ID
Register Write
EX/MEM
MEM/WB
Control Decode
Memory
Reg. Block
$4 $5 Rs Rt Rd Write Data Bits 0-15 Bits 16-20 Bits 11-15 Read Data 1 Read Data 2
P C
[ $2 ] [$3] M U X
ALU
[ $3 ] sub
Data Address
Read Data
Sign X Extend
X $ 12 $ 11 M U X
Memory
$ 10
24
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
Memory/ALU Result
Left shift 2
+4
ADD
ADD
M U X
EX: and $12, $4, $5 MEM: sub $11, $2, $3 WB: lw $10, 20($1)
ID/EX EX/MEM MEM/WB
M U X IF/ID
Register Write
Control Decode
Memory
Reg. Block
$6 $7 Rs Rt Read Data 1 Read Data 2 [ $6 ] [ $7 ]
P C
[ $4 ] [$5] M U X
ALU
[ $5 ] and
Data Address
Read Data
X Sign 32 Extend X $ 13 M $ 12 U X
Memory
$ 11 $ 10
25
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
Memory/ALU Result
Left shift 2
+4
ADD
ADD
M U X
IF: Idle
ID/EX M U X IF/ID
Register Write
EX/MEM
MEM/WB
Control Decode
Memory
Reg. Block
$8 $9 Rs Rt Read [ $8 ] Data 1 Read Data 2 [ $9 ]
P C
[ $6 ] [$7] M U X
ALU
[ $7 ] or
Data Address
Read Data
Sign X Extend
X $ 14 $ 13 M U X
Memory
$ 12 $ 11
26
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
Memory/ALU Result
Left shift 2
+4
ADD
ADD
M U X
IF: Idle
ID/EX M U X IF/ID
Register Write
EX/MEM
Control Decode
Memory
Reg. Block
Rs Rt $12 Rd Write Data Bits 0-15 Bits 16-20 Bits 11-15 Read Data 1 Read Data 2
P C
[ $8 ] [$9] M U X
ALU
[ $9 ] add
Data Address
Read Data
Sign 32 Extend
M U X
Memory
$ 13 $ 12
27
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
Memory/ALU Result
Left shift 2
+4
ADD
ADD
M U X
IF: Idle
EX: Idle
ID/EX M U X IF/ID
Register Write
EX/MEM
MEM/WB
Control Decode
Memory
Reg. Block
Rs Rt $13 Rd Write Data Bits 0-15 Bits 16-20 Bits 11-15 Read Data 1 Read Data 2
P C
M U X
ALU
Data Address
Read Data
Sign 32 Extend
M U X
Memory
$ 14 $ 13
28
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
Memory/ALU Result
Left shift 2
+4
ADD
ADD
M U X
IF: Idle
EX: Idle
MEM: Idle
ID/EX M U X IF/ID
Register Write
EX/MEM
Control Decode
Memory
Reg. Block
Rs Rt $14 Rd Write Data Bits 0-15 Bits 16-20 Bits 11-15 Read Data 1 Read Data 2
P C
M U X
ALU
Data Address
Read Data
Sign 32 Extend
M U X
Memory
$ 14
29
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
Memory/ALU Result
Left shift 2
+4
ADD
ADD
M U X
IF: Idle
ID/RF: Idle
EX: Idle
MEM: Idle
WB: Idle
MEM/WB
ID/EX M U X IF/ID
Register Write
EX/MEM
Control Decode
Memory
Reg. Block
Rs Rt Rd Write Data Bits 0-15 Bits 16-20 Bits 11-15 Read Data 1 Read Data 2
P C
M U X
ALU
Data Address
Read Data
Sign 32 Extend
M U X
Memory
30
Memory/ALU Result
Left shift 2
+4
ADD
ADD
M U X
N. B. Dodge 09/12
Exercise 1
On the diagram on the next page, identify the f following: i
1. Highlight all the control lines that must be active during a load word instruction. 2. As in our exercise in Lecture 20, identify the decoder locations. 3. The ID/EX Register g interface stores the most bits of any y of the pipeline section interfaces. Approximately how many bits is that, according to the diagram?
32
N. B. Dodge 09/12
EX/MEM
MEM/WB
Control Decode
Memory
Reg. Block
Rs Rt Rd Write Data Bits 0-15 Bits 16-20 Bits 11-15 Read Data 1 Read Data 2
P C
M U X
ALU
Data Address
Read D t Data
Sign 32 Extend
M U X
Memory
Memory/ALU U Result
Left shift 2
+4
ADD
ADD
M U X
Hazards
Hazards occur because data required for executing the current instruction may not be available available. An instruction in the register fetch cycle may need data from a register whose value will be changed by an instruction downstream but still in process in the pipeline (in the ALU, memory/memory bypass or writeback w eb c cycle). cyc e). Thus an upstream instruction could access a register and get incorrect data because the register data has not yet t been b updated d t db by a d downstream t i instruction. t ti
35
N. B. Dodge 09/12
Hazards (2)
There are two types of hazards, data hazards, and control hazards. Both occur because an instruction in the ID/RF stage of the MIPS pipeline needs register data that will be shortly updated by instructions in the EX or MEM/Bypass, or WB stage. Data hazards occur when an instruction needs register contents for an arithmetic/ logical/memory instruction. Control hazards occur when a branch instruction is pending and the data necessary to initiate/bypass the branch is not yet available in the same sort of scenario.
36
N. B. Dodge 09/12
1
Reg. Fetch Instruc. Fetch
2
ALU Process Reg. Fetch Instruc. Fetch
10
5 clock cycles
Mem. R/W Reg. or ALU Out Write ALU Process Reg. Fetch Instruc. Fetch Mem. R/W Reg. or ALU Out Write ALU Process Reg. Fetch Instruc. Fetch Mem. R/W Reg. or ALU Out Write ALU Process Reg. Fetch Mem. R/W Reg. or ALU Out Write ALU Process Mem. R/W Reg. or ALU Out Write
37
In the instruction sequence above, the last four instructions require data from $2, which is changed in the first instruction. The $2 data will not be rewritten until cycle 4, so the AND and OR (2nd and d 3rd instructions) i t ti ) will ill fetch f t hi incorrect t data d t f from $2 $2. Even the add may not get the correct information (sw is okay).
Lecture #20: The Pipeline MIPS Processor
N. B. Dodge 09/12
1
Reg. Fetch Instruc. Fetch
2
ALU Process Reg. Fetch Instruc. Fetch
10
5 clock cycles
Mem. R/W Reg. or ALU Out Write ALU Process Reg. Fetch Instruc. Fetch Mem. R/W Reg. or ALU Out Write ALU Process Reg. Fetch Instruc. Fetch Mem. R/W Reg. or ALU Out Write ALU Process Reg. Fetch Mem. R/W Reg. or ALU Out Write ALU Process Mem. R/W Reg. or ALU Out Write
38
Here the problem is changed, with two branch instructions added. Neither branch instruction may be executed correctly, once again because the new $2 data will not be ready ready. This wrong data could cause an incorrect branch.
Lecture #20: The Pipeline MIPS Processor
N. B. Dodge 09/12
IF 2
ID/ RF IF
ALU ID/ RF
MEM ALU
WB MEM WB
39
One solution to the problem of data hazards is forwarding. Forwarding uses the fact that although instruction 2 needs register data two clock cycles before instruction 1 enters the WB stage, that data is already available as the output of the ALU. If a mechanism were available, instruction 1 could forward required register data after its ALU cycle to the ID/RF cycle of instruction 2.
Lecture #20: The Pipeline MIPS Processor
N. B. Dodge 09/12
ALU
Data Address
Read Data
Reg. Block
M U X
Write W it Data
Memory
Forwarding Unit
MEM/WB Register Rd
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
40
N. B. Dodge 09/12
Forwarding Unit
41
The forwarding unit samples register ids in the EX/MEM and MEM/WB registers to determine if source registers in the ID/RF cycle l are the th same. If so, source register data is replaced by pipeline (as yet unwritten) data by the forwarding unit. The correct information is thus processed and the instruction can proceed to correct execution.
Lecture #20: The Pipeline MIPS Processor
N. B. Dodge 09/12
Stalls
Forwarding will not always solve the problems of data hazards. For example, suppose an add instruction follows a load word (lw), and the add involves the register that receives the memory data. In this case, forwarding will not work. The reason is that the data must be read from memory memory, and so it will not be available until the end of the MEM cycle. Thus the required data is not available for a forward, and the add instruction. s uc o . if it p proceeds, oceeds, will w process p ocess the ew wrong o gd data. . A solution to this problem is the stall. A stall halts the instruction awaiting data, while the key instruction (a lw in this case) proceeds to the end of the MEM cycle, after which the desired data is available to the add.
Lecture #20: The Pipeline MIPS Processor
N. B. Dodge 09/12
42
1
Reg. Fetch Instruc. Fetch
2
ALU Process Reg. Fetch Instruc. Fetch
10
5 clock cycles
Mem. R/W Reg. or ALU Out Write ALU Process Reg. Fetch Mem. R/W Reg. or ALU Out Write ALU Process Mem. R/W Reg. or ALU Out Write
43
Consider the 3 instructions above, the last two depending on the lw. $2 contents will be available at the beginning of the WB stage in the first instruction, but not before. A solution is to let the lw proceed down the pipe, pipe while the add and sw instructions hold place for one cycle. N. B. Dodge 09/12
Lecture #20: The Pipeline MIPS Processor
1
Reg Reg. Fetch
2
ALU Process Instruc. Fetch
10
5 clock cycles
Mem R/W Reg. Mem. Reg or ALU Out Write Reg. Fetch Instruc. Fetch ALU Process Reg. Fetch Mem. R/W Reg. or ALU Out Write ALU Process Mem. R/W Reg. or ALU Out Write
44
With the delay, the lw result feeds the ALU input stage of the add instruction, and the fetch stage of the sw. Note that forwarding in still required (this time from the MEM/WB interface, not the ALU output). However, However in addition to forwarding, forwarding instructions following a lw must also be delayed for one clock cycle. N. B. Dodge 09/12
Lecture #20: The Pipeline MIPS Processor
45
In either case, the wrong instructions are in the pipe and they must be eliminated (flushed). How can this problem be prevented? Af few approaches h to the h problem bl are shown h in i the h following f ll i slides. lid
Lecture #20: The Pipeline MIPS Processor
N. B. Dodge 09/12
(Branch)
MEM/ Bypass
WB
46
N. B. Dodge 09/12
Branch
Branch Comparator
ALU/EX
MEM/ Bypass
WB
47
N. B. Dodge 09/12
Branch
MEM/ Bypass
WB
48
N. B. Dodge 09/12
Exercise 2
1. Explain forwarding in your own words. 2 Why doesnt forwarding always work? How can this 2. problem be solved? 3. Why y could 2-bit dynamic y branch prediction p work to ensure about a 1% error rate in branch prediction in a subroutine that loops about 100 times before completion? Hint: Assume that the subroutine is called frequently, and that it always executes 100 or more loop traversals before returning to the calling program.
49
N. B. Dodge 09/12