You are on page 1of 128

CPE 408340 Computer Organization Chapter 4: The Processor: Datapath and Control

Saed R. Abed
[Computer Engineering Department, Hashemite University] [Adapted from Otmane Ait Mohamed Slides & Computer Organization and Design, Patterson & Hennessy, 2005, UCB] 1

Review: Design Principles

Simplicity favors regularity


fixed size instructions 32-bits only three instruction formats

Good design demands good compromises

three instruction formats

Smaller is faster

limited instruction set limited number of registers in register file limited number of addressing modes

Make the common case fast


arithmetic operands from the register file (load-store machine) allow instructions to contain immediate operands
2

Review: THE Performance Equation

Our basic performance equation is then CPU time = Instruction_count x CPI x clock_cycle
or

CPU time

Instruction_count x CPI ----------------------------------------------clock_rate

These equations separate the three key factors that affect performance

Can measure the CPU execution time by running the program The clock rate is usually given in the documentation Can measure instruction count by using profilers/simulators without knowing all of the implementation details CPI varies by instruction type and ISA implementation for which we must know the implementation details
3

Datapath design tended to just work Control paths are where the system complexity lives. Bugs spawned from control path design errors reside in the microcode flow, the finite-state machines, and all the special exceptions that inevitably spring up in a machine design like thistles in a flower garden.

The Pentium Chronicles, Colwell, pg. 64

4.1 The Processor: Datapath & Control

Our implementation of the MIPS is simplified


memory-reference instructions: lw, sw arithmetic-logical instructions: add, sub, and, or, slt control flow instructions: beq, j

Generic implementation

use the program counter (PC) to supply the instruction address and fetch the instruction from memory (and update the PC) decode the instruction (and read registers) execute the instruction

Fetch PC = PC+4 Exec Decode

All instructions (except j) use the ALU after reading the registers How? memory-reference? arithmetic? control flow?
5

Abstract Implementation View

Two types of functional units:


elements that operate on data values (combinational) elements that contain state (sequential)

Instruction Memory PC Address Instruction

Write Data Register Read Data Reg Addr File Reg Addr Read Data Reg Addr

Address Data Memory Read Data Write Data

ALU

Single cycle operation Split memory (Harvard) model - one memory for instructions and one for data
6

4.2 Logic Design Conventions: Clocking Methodologies

The clocking methodology defines when signals can be read and when they are written

An edge-triggered methodology read contents of state elements send values through combinational logic write results to one or more state elements
State element 1 Combinational logic State element 2

Typical execution

clock

one clock cycle

Assumes state elements are written on every clock cycle; if not, need explicit write control signal

write occurs only when both the write control is asserted and the clock edge occurs

4.3 Building a Datapath: Fetching Instructions

Fetching instructions involves


reading the instruction from the Instruction Memory updating the PC value to be the address of the next (sequential) instruction

clock
4

Add

Fetch PC = PC+4 Exec Decode

Instruction Memory PC Read Address Instruction

PC is updated every clock cycle, so it does not need an explicit write control signal just a clock signal Reading from the Instruction Memory is a combinational activity, so it doesnt need an explicit read control signal

Decoding Instructions

Decoding instructions involves

sending the fetched instructions opcode and function field bits to the control unit

Fetch PC = PC+4 Exec Decode

Control Unit

and

Instruction

Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2

reading two values from the Register File


- Register File addresses are contained in the instruction
9

Reading Registers Just in Case

Note that both RegFile read ports are active for all instructions during the Decode cycle using the rs and rt instruction field addresses

Since havent decoded the instruction yet, dont know what the instruction is ! Just in case the instruction uses values from the RegFile do work ahead by reading the two source operands

Which instructions do make use of the RegFile values?

Also, all instructions (except j) use the ALU after reading the registers Why? memory-reference? arithmetic? control flow?

10

Executing R Format Operations

R format operations (add, sub, slt, and, or)


31 R-type: op

25 rs

20 rt

15 rd

10

shamt funct

perform operation (op and funct) on values in rs and rt store the result back into the Register File (into location rd)
RegWrite ALU control

Fetch PC = PC+4 Exec Decode

Instruction

Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2

ALU

overflow zero

Note that Register File is not written every cycle (e.g. sw), so we need an explicit write control signal for the Register File
11

Consider slt Instruction

R format operations (add, sub, slt, and, or)


31 R-type: op

25 rs

20 rt

15 rd

10

shamt funct

perform operation (op and funct) on values in rs and rt store the result back into the Register File (into location rd)
RegWrite ALU control

Fetch PC = PC+4 Exec Decode

Instruction

Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2

ALU

overflow zero

Note that Register File is not written every cycle (e.g. sw), so we need an explicit write control signal for the Register File
12

Consider the slt Instruction

Remember the R format instruction slt


slt $t0, $s0, $s1 # if $s0 < $s1 # then $t0 = 1 # else $t0 = 0

Where does the 1 (or 0) come from to store into $t0 in the Register File at the end of the execute cycle?
RegWrite ALU control

Instruction

Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2

ALU

overflow zero

13 2

Executing Load and Store Operations

Load and store operations have to


31 I-Type: op 25 rs 20 rt 15 address offset 0

compute a memory address by adding the base register (in rs) to the 16-bit signed offset field in the instruction
- base register was read from the Register File during decode - offset value in the low order 16 bits of the instruction must be sign extended to create a 32-bit signed value

store value, read from the Register File during decode, must be written to the Data Memory load value, read from the Data Memory, must be stored in the Register File
14

Executing Load and Store Operations, cont

RegWrite

ALU control overflow zero

MemWrite

Instruction

Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2

Address ALU Data Memory Read Data Write Data

16

Sign Extend

MemRead
32

16

Executing Branch Operations

Branch operations have to


31 I-Type: op 25 rs 20 rt 15 address offset 0

compare the operands read from the Register File during decode (rs and rt values) for equality (zero ALU output) compute the branch target address by adding the updated PC to the sign extended16-bit signed offset field in the instruction

- base register is the updated PC - offset value in the low order 16 bits of the instruction must be sign extended to create a 32-bit signed value and then shifted left 2 bits to turn it into a word address
17

Executing Branch Operations, cont

Add 4 Shift left 2 Add

Branch target address

ALU control
PC Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2

zero (to branch control logic)


ALU

Instruction

16

Sign Extend

32

19

Executing Jump Operations

Jump operations have to


31 J-Type: op

25 jump target address

replace the lower 28 bits of the PC with the lower 26 bits of the fetched instruction shifted left by 2 bits

Add 4 4 Instruction Memory PC Read Address Instruction 26 Shift left 2

Jump address
28

20

Creating a Single Datapath from the Parts

Assemble the datapath segments and add control lines and multiplexors as needed Single cycle design fetch, decode and execute each instructions in one clock cycle

no datapath resource can be used more than once per instruction, so some must be duplicated (e.g., separate Instruction Memory and Data Memory, several adders) multiplexors needed at the input of shared elements with control lines to do the selection write signals to control writing to the Register File and Data Memory

Cycle time is determined by length of the longest path


21

Fetch, R, and Memory Access Portions

Add 4

RegWrite

ALUSrc ALU control ovf zero

MemWrite

MemtoReg

Instruction Memory PC Read Address Instruction

Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2

Address ALU Data Memory Read Data Write Data

Sign 16 Extend

MemRead
32

22

Multiplexor Insertion

Add 4

RegWrite

ALUSrc ALU control ovf zero

MemWrite

MemtoReg

Instruction Memory PC Read Address Instruction

Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2

Address ALU Data Memory Read Data Write Data

Sign 16 Extend

MemRead
32

23

Clock Distribution
System Clock

clock cycle
RegWrite
Add 4 Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2

MemWrite ALUSrc ALU control ovf zero MemtoReg

Instruction Memory PC Read Address Instruction

Address ALU Data Memory Read Data Write Data

Sign 16 Extend

MemRead
32

24

Adding the Branch Portion

Add 4 Shift left 2 Add

PCSrc MemWrite MemtoReg

RegWrite

ALUSrc ALU control ovf zero

Instruction Memory PC Read Address Instruction

Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2

Address ALU Data Memory Read Data Write Data

Sign 16 Extend

MemRead
32

26

4.4 A Simple Implementation Scheme: Our Simple Control Structure

We wait for everything to settle down


ALU might not produce right answer right away Memory and RegFile reads are combinational (as are ALU, adders, muxes, shifter, signextender) Use write signals along with the clock edge to determine when to write to the sequential elements (to the PC, to the Register File and to the Data Memory)

The clock cycle time is determined by the logic delay through the longest path

We are ignoring some details like register setup and hold times
27

Adding the Control

Selecting the operations to perform (ALU, Register File and Memory read/write) Controlling the flow of data (multiplexor inputs)
31 R-type: op 25 rs 25 rs 25 20 rt 20 rt 15 rd 15 address offset 0 10 5 0 shamt funct 0

Observations

31 I-Type: op 31

op field always in bits 31-26

addr of registers J-type: op target address to be read are always specified by the rs field (bits 25-21) and rt field (bits 20-16); for lw and sw rs is the base register addr. of register to be written is in one of two places in rt (bits 20-16) for lw; in rd (bits 15-11) for R-type instructions offset for beq, lw, and sw always in bits 15-0 28

Single Cycle Datapath with Control Unit


0
Add 4 ALUOp Instr[31-26] Control Unit ALUSrc RegWrite RegDst Instr[25-21] Read Addr 1 Register Read Instr[20-16] Read Addr 2 Data 1 File 0 Write Addr Read Branch Shift left 2 Add

1
PCSrc MemRead MemtoReg MemWrite

ovf zero
ALU Address Data Memory Read Data Write Data

Instruction Memory PC Read Address Instr[31-0]

1 0

0 1
ALU control

Instr[15 -11] Instr[15-0]

Write Data

Data 2

Sign 16 Extend

32

Instr[5-0]

29

ALU Control

ALU's operation based on instruction type and function code


ALU control input 0000 0001 0010 0011 0110 1110 1111 Function and or xor nor add subtract set on less than

Notice that we are using different encodings than in the book


30

ALU Control, Cont

Controlling the ALU uses of multiple decoding levels


main control unit generates the ALUOp bits ALU control unit generates ALUcontrol bits

Instr op lw sw beq add subt and or xor nor slt

funct ALUOp action add xxxxxx 00 add xxxxxx 00 subtract xxxxxx 01 100000 10 add 100010 10 subtract 100100 10 and 100101 10 or 100110 10 xor 100111 10 nor 101010 10 slt

ALUcontrol 0110 0110 1110 0110 1110 0000 0001 0010 0011 1111
32

ALU Control Truth Table


F5 F4 F3 F2 F1 F0 ALU Op1 ALU Op0

Our ALU m control input


ALU control3 ALU control2 ALU control1 ALU control0

X X X X X X X X X

X X X X X X X X X X X 0 0 0 0 X 0 0 1 0 X 0 1 0 0 X 0 1 0 1 X 0 1 1 0 X 0 1 1 1 X 1 0 1 0

0 0 1 1 1 1 1 1 1

0 1 0 0 0 0 0 0 0

0 1 0 1 0 0 0 0 1
Add/subt

1 1 1 1 0 0 0 0 1

1 1 1 1 0 0 1 1 1

0 0 0 0 0 1 0 1 1

Mux control

Four, 6-input truth tables


34

ALU Control Logic

From the truth table can design the ALU Control logic

Instr[3] Instr[2] Instr[1] Instr[0] ALUOp1 ALUOp0

ALUcontrol3 ALUcontrol2 ALUcontrol1

ALUcontrol0

35

R-type Instruction Data/Control Flow


0
Add 4 ALUOp Instr[31-26] Control Unit ALUSrc RegWrite RegDst Instr[25-21] Read Addr 1 Register Read Instr[20-16] Read Addr 2 Data 1 File 0 Write Addr Read Branch Shift left 2 Add

1
PCSrc MemRead MemtoReg MemWrite

ovf zero
ALU Address Data Memory Read Data Write Data

Instruction Memory PC Read Address Instr[31-0]

1 0

0 1
ALU control

Instr[15 -11] Instr[15-0]

Write Data

Data 2

Sign 16 Extend

32

Instr[5-0]

36

Store Word Instruction Data/Control Flow


0
Add 4 ALUOp Instr[31-26] Control Unit ALUSrc RegWrite RegDst Instr[25-21] Read Addr 1 Register Read Instr[20-16] Read Addr 2 Data 1 File 0 Write Addr Read Branch Shift left 2 Add

1
PCSrc MemRead MemtoReg MemWrite

ovf zero
ALU Address Data Memory Read Data Write Data

Instruction Memory PC Read Address Instr[31-0]

1 0

0 1
ALU control

Instr[15 -11] Instr[15-0]

Write Data

Data 2

Sign 16 Extend

32

Instr[5-0]

38

Load Word Instruction Data/Control Flow


0
Add 4 ALUOp Instr[31-26] Control Unit ALUSrc RegWrite RegDst Instr[25-21] Read Addr 1 Register Read Instr[20-16] Read Addr 2 Data 1 File 0 Write Addr Read Branch Shift left 2 Add

1
PCSrc MemRead MemtoReg MemWrite

ovf zero
ALU Address Data Memory Read Data Write Data

Instruction Memory PC Read Address Instr[31-0]

1 0

0 1
ALU control

Instr[15 -11] Instr[15-0]

Write Data

Data 2

Sign 16 Extend

32

Instr[5-0]

40

Branch Instruction Data/Control Flow


0
Add 4 ALUOp Instr[31-26] Control Unit ALUSrc RegWrite RegDst Instr[25-21] Read Addr 1 Register Read Instr[20-16] Read Addr 2 Data 1 File 0 Write Addr Read Branch Shift left 2 Add

1
PCSrc MemRead MemtoReg MemWrite

ovf zero
ALU Address Data Memory Read Data Write Data

Instruction Memory PC Read Address Instr[31-0]

1 0

0 1
ALU control

Instr[15 -11] Instr[15-0]

Write Data

Data 2

Sign 16 Extend

32

Instr[5-0]

43

Main Control Unit


Instr RegDst ALUSrc MemReg RegWr MemRd MemWr Branch ALUOp

R-type
000000

1 0 X X

0 1 1 0

0 1 X X

1 1 0 0

0 1 0 0

0 0 1 0

0 0 0 1

10 00 00 01

lw
100011

sw
101011

beq
000100

Setting

of the MemRd signal (for R-type, sw, beq) depends on the memory design (could have to be 0 or could be a X (dont care))
44

Control Unit Logic

From the truth table can design the Main Control logic

Instr[31] Instr[30] Instr[29] Instr[28] Instr[27] Instr[26]

R-type

lw

sw

beq

RegDst ALUSrc MemtoReg RegWrite MemRead MemWrite Branch ALUOp1 ALUOp0

45

Review: Handling Jump Operations

Jump operation have to

replace the lower 28 bits of the PC with the lower 26 bits of the fetched instruction shifted left by 2 bits
31 0 jump target address

J-Type: op

Add 4 4 Instruction Memory PC Read Address Instruction 26 Shift left 2

Jump address
28

46

Adding the Jump Operation


Instr[25-0] 26 Add 4 ALUOp Instr[31-26] Control Unit ALUSrc RegWrite RegDst Instr[25-21] Read Addr 1 Register Read Instr[20-16] Read Addr 2 Data 1 File 0 Write Addr Read Jump Branch Shift left 2 Add Shift left 2

1
28 32 PC+4[31-28]

0 0 1
PCSrc MemRead MemtoReg MemWrite

ovf zero
ALU Address Data Memory Read Data Write Data

Instruction Memory PC Read Address Instr[31-0]

1 0

0 1
ALU control

Instr[15 -11] Instr[15-0]

Write Data

Data 2

Sign 16 Extend

32

Instr[5-0]

48

Main Control Unit


Instr RegDst ALUSrc MemReg RegWr MemRd MemWr Branch ALUOp Jump

R-type
000000

1 0 X X X

0 1 1 0 X

0 1 X X X

1 1 0 0 0

0 1 0 0 0

0 0 1 0 0

0 0 0 1 X

10 00 00 01 XX

0 0 0 0 1

lw
100011

sw
101011

beq
000100

j
000010

Setting

of the MemRd signal (for R-type, sw, beq) depends on the memory design
50

Single Cycle Implementation Cycle Time

Unfortunately, though simple, the single cycle approach is not used because it is very slow Clock cycle must have the same length for every instruction

What is the longest (slowest) path (slowest instruction)?

51

Instruction Critical Paths


Calculate cycle time assuming negligible delays (for muxes, control unit, sign extend, PC access, shift left 2, wires, setup and hold times) except:

Instruction and Data Memory (4 ns) and adders (2 ns) Register File access (reads or writes) (1 ns)

ALU

Instr. Rtype load store beq jump

I Mem 4 4 4 4 4

Reg Rd 1 1 1 1

ALU Op D Mem Reg Wr 2 2 2 2 4 4 1 1

Total 8 12 11 7 4
53

Single Cycle Disadvantages & Advantages

Uses the clock cycle inefficiently the clock cycle must be timed to accommodate the slowest instruction

especially problematic for more complex instructions like floating point multiply
Cycle 1 Cycle 2

Clk lw sw Waste

May be wasteful of area since some functional units (e.g., adders) must be duplicated since they can not be shared during a clock cycle but Is simple and easy to understand
54

4.5 Pipelining is Natural!


Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 30 minutes Folder takes 30 minutes Stasher takes 30 minutes to put clothes into drawers
55

Sequential Laundry
6 PM 7 8 9 10 11 12 1 2 AM

T a s k O r d e r

30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 A B C D

Time

Sequential laundry takes 8 hours for 4 loads If they learned pipelining, how long would laundry take?
56

Pipelined Laundry: Start work ASAP


6 PM 7 8 9 10 11 12 1 2 AM

T a s k O r d e r

30 30 30 30 30 30 30 A B C D

Time

Pipelined laundry takes 3.5 hours for 4 loads!


57

Pipelining Lessons

6 PM

9
Time

T a s k O r d e r

Pipelining doesnt help latency of single task, it helps throughput of entire workload Multiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages Pipeline rate limited by slowest pipeline stage Unbalanced lengths of pipe stages reduces speedup Time to fill pipeline and time to drain it reduces speedup Stall for Dependences
58

30 30 30 30 30 30 30 A B C D

The Five Stages of Load


Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

Load Ifetch

Reg/Dec

Exec

Mem

Wr

Ifetch: Instruction Fetch

Fetch the instruction from the Instruction Memory

Reg/Dec: Registers Fetch and Instruction Decode Exec: Calculate the memory address Mem: Read the data from the Data Memory Wr: Write the data back to the register file

59

Pipelining

Improve performance by increasing instruction throughput


200 400 600 800 1000 1200 1400 1600 1800

Program execution Time order (in instructions)

lw $1, 100($0) Instruction R g e fetch lw $2, 200($0) lw $3, 300($0)

AU L 800 ps

Data ac es c s

Rg e Instruction R g e fetch AU L 800 ps Data ac s c es Rg e Instruction fetch 800 ps

Note: timing assumptions changed for this example

Program execution Time order (in instructions) lw $1, 100($0)

200

400

600

800

1000

1200

1400

Instruction fetch

Rg e

AU L Rg e Instruction fetch 200 ps

Data a es cc s AU L Rg e 200 ps

Rg e Data ac s c es AU L 200 ps Rg e Data ac es c s 200 ps Rg e 200 ps

lw $2, 200($0) 200 ps Instruction fetch lw $3, 300($0) 200 ps

Ideal speedup is number of stages in the pipeline. Do we achieve this?


60

Basic Idea
IF: Instruction fetch ID: Instruction decode/ register file read EX: Execute/ address calculation MEM: Memory access WB: Write back

Ad d 4 S it hf lf2 et Read Ra ed r g t r1 eie s dt 1 a a Read r g t r2 eie s Rgtr e i es s Write Ra ed r gse e it r dt 2 a a Wi re t daa t 1 6 2 Sign 3 et n x d e AD r s l D Add e ut

P C

A de s drs I sr ci n n tu to I sr ci n n tu to memory

Zr e o A AU L L U rsl e ut

A de s drs

Ra ed dat a Data Mm r eo y

Write data

What do we need to add to actually split the datapath into stages?


62

A Pipelined MIPS Processor

Start the next instruction before the current one has completed

improves throughput - total amount of work done in a given time instruction latency (execution time, delay time, response time time from the start of an instruction to its completion) is not reduced
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8

lw sw R-type

IFetch

Dec IFetch

Exec Dec IFetch

Mem Exec Dec

WB Mem Exec WB Mem WB

- clock cycle (pipeline stage time) is limited by the slowest stage - for some instructions, some stages are wasted cycles
63

Single Cycle, Multiple Cycle, vs. Pipeline


Single Cycle Implementation: Cycle 1 Clk lw Multiple Cycle Implementation: Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10 Clk lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem R-type IFetch sw Waste Cycle 2

Pipeline Implementation: lw IFetch sw Dec IFetch Exec Dec Mem Exec Dec WB Mem Exec WB Mem WB

pipeline clock same as multicycle clock

R-type IFetch

64

4.6 MIPS Pipeline Datapath Modifications

What do we need to add/modify in our MIPS datapath?

State registers between each pipeline stage to isolate them


IF:IFetch ID:Dec EX:Execute MEM: MemAccess WB: WriteBack

Add 4 Shift left 2 IFetch/Dec Add

Write Addr Write Data

Read Data 2

ALU

Address Write Data

16

Sign Extend

32

System Clock

Mem/WB

Read Address

File

Exec/Mem

Dec/Exec

Instruction Memory
PC

Read Addr 1

Register Read
Data 1 Read Addr 2

Data Memory
Read Data

65

MIPS Pipeline Control Path Modifications

All control signals can be determined during Decode

and held in the state registers between pipeline stages

ID/EX EX/MEM IF/ID Add 4 Shift left 2 Add Control MEM/WB

Instruction Memory
Read Address PC

Read Addr 1

Register Read
Data 1 Read Addr 2

Data Memory
ALU Address Write Data Read Data

File
Write Addr Write Data Read Data 2

16

Sign Extend

32

66

Single Data Path to be pipelined

67

Pipelined version of the single cycle datapath

68

Instructions being executed assuming pipelined execution

69

IF and ID

70

EX (lw) instruction)

71

MEM and WB

72

IF and ID (SW)

73

EX (SW)

74

MEM and WB

75

Correct Data Path to handle lw correctly

Example

77

Example

78

Pipelining the MIPS ISA

What makes it easy

all instructions are the same length (32 bits)

- can fetch in the 1st stage and decode in the 2nd stage

few instruction formats (three) with symmetry across formats

- can begin reading register file in 2nd stage

memory operations can occur only in loads and stores

- can use the execute stage to calculate memory addresses

each MIPS instruction writes at most one result (i.e., changes the machine state) and does so near the end of the pipeline (MEM and WB) structural hazards: what if we had only one memory? control hazards: what about branches? data hazards: what if an instructions input operands depend on the output of a previous instruction?

What makes it hard


79

Graphically Representing MIPS Pipeline

ALU

IM

Reg

DM

Reg

Can help with answering questions like:


How many cycles does it take to execute this code? What is the ALU doing during cycle 4? Is there a hazard, why does it occur, and how can it be fixed?

80

Why Pipeline? For Performance!


Time (clock cycles)

I n s t r. O r d e r

Inst 0 Inst 1 Inst 2 Inst 3 Inst 4

IM

Reg

DM

Reg

IM

Reg

DM

Reg

Once the pipeline is full, one instruction is completed every cycle so CPI = 1
Reg

IM

Time to fill the pipeline

ALU

Reg

IM

ALU

Reg

IM

ALU

DM

Reg

ALU

DM

Reg

ALU

DM

Reg

81

Can Pipelining Get Us Into Trouble?

Yes: Pipeline Hazards

structural hazards: attempt to use the same resource by two different instructions at the same time data hazards: attempt to use data before it is ready

- An instructions source operand(s) are produced by a prior instruction still in the pipeline

control hazards: attempt to make a decision about program control flow before the condition has been evaluated and the new PC target address calculated

- branch and jump instructions, exceptions


Can

always resolve hazards by waiting

pipeline control must detect the hazard and take action to resolve hazards
82

A Single Memory Would Be a Structural Hazard


Time (clock cycles)

I n s t r. O r d e r

lw Inst 1 Inst 2 Inst 3 Inst 4


Can

Mem

Reg

Mem

Reg

Reading data from memory


Reg

ALU Reg Mem

ALU

Mem

Mem

ALU Reg Mem

Reg

Mem

Reg

ALU

Mem

Mem

Reg

ALU

fix with separate instr and data memories


83

Reading instruction from memory

Reg

Mem

Reg

How About Register File Access?


Time (clock cycles)

I n s t r. O r d e r

add $1, Inst 1 Inst 2

IM

Reg

DM

Reg

IM

Reg

DM

Reg

Fix register file access hazard by doing reads in the second half of the cycle and writes in the first half
Reg

IM

ALU

Reg

ALU

ALU

DM

ALU

add $2,$1,

IM

Reg

DM

Reg

clock edge that controls register writing

clock edge that controls loading of pipeline state registers

85

Register Usage Can Cause Data Hazards

Dependencies backward in time cause hazards


ALU

add $1, sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9

IM

Reg

DM

Reg

ALU

IM

Reg

DM

Reg

ALU

IM

Reg

DM

Reg

ALU

IM

Reg

DM

Reg

ALU

xor $4,$1,$5
Read

IM

Reg

DM

Reg

before write data hazard


87

Loads Can Cause Data Hazards

Dependencies backward in time cause hazards


ALU

I n s t r. O r d e r

lw

$1,4($2)

IM

Reg

DM

Reg

ALU

sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9

IM

Reg

DM

Reg

ALU

IM

Reg

DM

Reg

ALU

IM

Reg

DM

Reg

ALU

xor $4,$1,$5
Load-use

IM

Reg

DM

Reg

data hazard
89

4.7 MIPS Pipeline Data and Control Paths

90

4.7 MIPS Pipeline Data and Control Paths

91

4.7 MIPS Pipeline Data and Control Paths


PCSrc ID/EX EX/MEM Control IF/ID Add 4 RegWrite Shift left 2 Add Branch MEM/WB

Instruction Memory
Read Address PC

Read Addr 1

Register Read
Data 1 Read Addr 2

Data Memory
ALUSrc ALU Address Write Data ALU cntrl Read Data

MemtoReg

File
Write Addr Write Data Read Data 2

16

Sign Extend

MemWrite MemRead

32

ALUOp

RegDst

92

Control Settings

EX Stage Reg Dst 1 0 X X

MEM Stage

WB Stage

R lw sw beq

ALU ALU ALU Brch Mem Mem Reg Mem Op1 Op0 Src Read Write Write toReg 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 0 1 0 1 0 0 1 X X

CSE431 L05 Basic MIPS Architecture.93

Irwin, PSU, 2005

93

One Way to Fix a Data Hazard


Can fix data hazard by waiting stall

ALU

I n s t r. O r d e r

add $1, stall stall

IM

Reg

DM

Reg

ALU

sub $4,$1,$5 and $6,$1,$7

IM

Reg

DM

Reg

ALU

IM

Reg

DM

Reg

94

Another Way to Fix a Data Hazard


Fix data hazards by forwarding results as soon as they are available to where they are needed
Reg ALU

I n s t r. O r d e r

add $1,

IM

Reg

DM

Reg

ALU

sub $4,$1,$5

IM

Reg

DM

Reg

ALU

and $6,$1,$7 or $8,$1,$9

IM

Reg

DM

ALU

IM

Reg

DM

Reg

ALU

xor $4,$1,$5

IM

Reg

DM

Reg

96

Data Forwarding (aka Bypassing)

Take the result from the earliest point that it exists in any of the pipeline state registers and forward it to the functional units (e.g., the ALU) that need it that cycle For ALU functional unit: the inputs can come from any pipeline register rather than just from ID/EX by

adding multiplexors to the inputs of the ALU connecting the Rd write data in EX/MEM or MEM/WB to either (or both) of the EXs stage Rs and Rt ALU mux inputs adding the proper control hardware to control the new muxes

Other functional units may need similar forwarding logic (e.g., the DM) With forwarding can achieve a CPI of 1 even in the presence of data dependencies
99

Data Forwarding Control Conditions


1.

EX/MEM hazard:
Forwards the != 0) = ID/EX.RegisterRs)) result from the previous instr. to either input of the ALU != 0) = ID/EX.RegisterRt))

if (EX/MEM.RegWrite and (EX/MEM.RegisterRd and (EX/MEM.RegisterRd ForwardA = 10 if (EX/MEM.RegWrite and (EX/MEM.RegisterRd and (EX/MEM.RegisterRd ForwardB = 10
2.

MEM/WB hazard:
Forwards the != 0) = ID/EX.RegisterRs)) result from the second previous instr. to either input != 0) = ID/EX.RegisterRt)) of the ALU
100

if (MEM/WB.RegWrite and (MEM/WB.RegisterRd and (MEM/WB.RegisterRd ForwardA = 01 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd and (MEM/WB.RegisterRd ForwardB = 01

Forwarding Illustration

ALU

I n s t r. O r d e r

add $1,

IM

Reg

DM

Reg

ALU

sub $4,$1,$5

IM

Reg

DM

Reg

ALU

and $6,$7,$1

IM

Reg

DM

Reg

EX/MEM hazard forwarding

MEM/WB hazard forwarding

101

Yet Another Complication!

Another potential data hazard can occur when there is a conflict between the result of the WB stage instruction and the MEM stage instruction which should be forwarded?

I n s t r. O r d e r

ALU

add $1,$1,$2 add $1,$1,$3

IM

Reg

DM

Reg

ALU

IM

Reg

DM

Reg

ALU

add $1,$1,$4

IM

Reg

DM

Reg

103

Corrected Data Forwarding Control Conditions


2.

MEM/WB hazard:

if (MEM/WB.RegWrite and (MEM/WB.RegisterRd != 0) and (EX/MEM.RegisterRd != ID/EX.RegisterRs) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd != 0) and (EX/MEM.RegisterRd != ID/EX.RegisterRt) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01

104

Datapath with Forwarding Hardware


PCSrc ID/EX EX/MEM Control IF/ID Add 4 Shift left 2 Add Branch MEM/WB

Instruction Memory
Read Address PC

Read Addr 1

Register Read
Data 1 Read Addr 2

Data Memory
ALU Address Read Data Write Data ALU cntrl

File
Write Addr Write Data 16 Sign Extend Read Data 2 32

EX/MEM.RegisterRd ID/EX.RegisterRt ID/EX.RegisterRs Forward Unit MEM/WB.RegisterRd

106

Memory-to-Memory Copies

For loads immediately followed by stores (memory-tomemory copies) can avoid a stall by adding forwarding hardware from the MEM/WB register to the data memory input.

Would need to add a Forward Unit and a mux to the memory access stage

I n s t r. O r d e r

ALU

lw $1,4($2)

IM

Reg

DM

Reg

ALU

sw $1,4($3)

IM

Reg

DM

Reg

107

Forwarding with Load-use Data Hazards


ALU

I n s t r. O r d e r

lw stall

IM $1,4($2)

Reg

DM

Reg

ALU

IM

Reg

DM ALU

Reg

sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9

IM

Reg

DM

Reg

ALU

IM

Reg

DM

Reg

ALU

IM

Reg

DM

Reg

ALU

xor $4,$1,$5

IM

Reg

DM

110

Adding the Hazard Hardware


PCSrc Hazard Unit IF/ID Control 0 Add 4 Shift left 2 Add Branch MEM/WB ID/EX ID/EX.MemRead EX/MEM

0 1

Instruction Memory
Read Address PC

Read Addr 1

Register Read
Data 1 Read Addr 2

Data Memory
ALU Address Read Data Write Data ALU cntrl

File
Write Addr Write Data 16 Sign Extend Read Data 2 32

ID/EX.RegisterRt

Forward Unit

111

4.8 Control Hazards

When the flow of instruction addresses is not sequential (i.e., PC = PC + 4)


Conditional branches (beq, bne) Unconditional branches (j, jal, jr) Exceptions

Possible solutions

Stall (impacts performance) Move branch decision point as early in the pipeline as possible, thereby reducing the number of stall cycles Delay decision (requires compiler support) Predict and hope for the best !

Control hazards occur less frequently than data hazards, but there is nothing as effective against control hazards as forwarding is for data hazards
112

Datapath Branch and Jump Hardware


Jump PCSrc Shift left 2 IF/ID Add PC+4[31-28] 4 Shift left 2 Add Control Branch MEM/WB ID/EX EX/MEM

Instruction Memory
Read Address PC

Read Addr 1

Register Read
Data 1 Read Addr 2

Data Memory
ALU Address Read Data Write Data ALU cntrl

File
Write Addr Write Data 16 Sign Extend Read Data 2 32

Forward Unit

114

Jumps Incur One Stall


Jumps

not decoded until ID, so one flush is needed

To flush, set IF.Flush to zero the instruction field of the IF/ID pipeline register (turning it into a noop)
IM Reg DM Reg

I n s t r. O r d e r

j flush j target

IM

Reg

DM ALU

Reg

Fix jump hazard by waiting flush

IM

Fortunately, jumps are very infrequent only 3% of the SPECint instruction mix
115

ALU

Reg

ALU

DM

Reg

Supporting ID Stage Jumps


Jump PCSrc Shift left 2 IF/ID Add PC+4[31-28] 4 Shift left 2 Add Control Branch MEM/WB ID/EX EX/MEM

Instruction Memory
Read Address PC

Read Addr 1

Register Read 0
Data 1 Read Addr 2

Data Memory
ALU Address Read Data Write Data ALU cntrl

File
Write Addr Write Data 16 Sign Extend Read Data 2 32

Forward Unit

116

Branches Cause Control Hazards

Dependencies backward in time cause hazards


ALU

I n s t r. O r d e r

beq lw Inst 3 Inst 4

IM

Reg

DM

Reg

ALU

IM

Reg

DM

Reg

ALU

IM

Reg

DM

Reg

ALU

IM

Reg

DM

Reg

117

One Way to Fix a Branch Control Hazard


I n s t r. O r d e r

beq flush flush flush beq target Inst 3

IM

Reg

DM

Reg

IM

Reg IM

DM ALU

Reg DM ALU

Fix branch hazard by waiting flush but affects CPI


Reg

ALU

Reg

IM

ALU

Reg

DM

Reg

ALU

IM

Reg

DM

Reg

ALU

IM

Reg

DM

118

Another Way to Fix a Branch Control Hazard

Move branch decision hardware back to as early in the pipeline as possible i.e., during the decode cycle

ALU

I n s t r. O r d e r

beq flush beq target Inst 3

IM

Reg

DM

Reg

IM

Reg

DM

Reg

Fix branch hazard by waiting flush

ALU

ALU

IM

Reg

DM

Reg

ALU

IM

Reg

DM

119

Yet Another Way to Fix a Control Hazard


Predict

branches are always not taken and take corrective action when wrong (i.e., taken)
4 beq $1,$2,2 8 flush sub $4,$1,$5
IM Reg DM Reg

I n s t r. O r d e r

IM

Reg

DM

Branch decision hardware moved to the decode cycle Reg

16 and $6,$1,$7 20 or r8,$1,$9

IM

To flush, set IF.Flush to zero the instruction field of the IF/ID pipeline register (turning it into a noop)
120

ALU

Reg

IM

ALU

Reg

ALU

DM

Reg

ALU

DM

Reg

Two Types of Stalls

Noop instruction (or bubble) inserted between two instructions in the pipeline (e.g., load-use hazards)

Keep the instructions earlier in the pipeline (later in the code) from progressing down the pipeline for a cycle (bounce them in place with write control signals) Insert noop instruction by zeroing control bits in the pipeline register at the appropriate stage Let the instructions later in the pipeline (earlier in the code) progress normally down the pipeline

Flushes (or instruction squashing) where an instruction in the pipeline is replaced with a noop instruction (as done for instructions located sequentially after j and beq instructions)

Zero the control bits for the instruction to be flushed

121

Many Other Pipeline Structures Are Possible

What about the (slow) multiply operation?


Make the clock twice as slow or let it take two cycles (since it doesnt use the DM stage)
MUL ALU IM Reg DM Reg

What

if the data memory access is twice as slow as the instruction memory?


make the clock twice as slow or let data memory access take two cycles (and keep the same clock rate)
ALU IM Reg DM1 DM2 Reg

122

Designing a Pipelined Processor

Go back and examine your datapath and control diagram associated resources with states ensure that flows do not conflict, or figure out how to resolve assert control in appropriate stage

123

Pipelining the Load Instruction


Cycle 1 Cycle 2 Clock 1st lw Ifetch Reg/Dec Exec Reg/Dec Ifetch Mem Exec Reg/Dec Wr Mem Exec Wr Mem Wr Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

2nd lw Ifetch 3rd lw

The five independent functional units in the pipeline datapath are:


Instruction Memory for the Ifetch stage Register Files Read ports (bus A and busB) for the Reg/Dec stage ALU for the Exec stage Data Memory for the Mem stage Register Files Write port (bus W) for the Wr stage
124

The Four Stages of R-type


Cycle 1 Cycle 2 Cycle 3 Cycle 4

R-type Ifetch

Reg/Dec

Exec

Wr

Ifetch: Instruction Fetch

Fetch the instruction from the Instruction Memory

Reg/Dec: Registers Fetch and Instruction Decode Exec:


ALU operates on the two register operands Update PC

Wr: Write the ALU output back to the register file

125

Pipelining the R-type and Load Instruction


Cycle 1 Cycle 2 Clock R-type Ifetch R-type Reg/Dec Ifetch Load Exec Reg/Dec Ifetch Wr Exec Reg/Dec Wr Exec Reg/Dec Mem Exec Reg/Dec Wr Wr Exec Wr Ops! We have a problem! Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9

R-type Ifetch

R-type Ifetch

We have pipeline conflict or structural hazard:


Two instructions try to write to the register file at the same time! Only one write port

126

Important Observation

Each functional unit can only be used once per instruction Each functional unit must be used at the same stage for all instructions:

Load uses Register Files Write Port during its 5th stage
1 2 3 4 5 Load Ifetch Reg/Dec Exec Mem Wr

R-type uses Register Files Write Port during its 4th stage
1 R-type Ifetch 2 Reg/Dec 3 Exec 4 Wr

2 ways to solve this pipeline hazard.

127

Solution 1: Insert Bubble into the Pipeline


Cycle 1 Cycle 2 Clock Ifetch Load Reg/Dec Ifetch Exec Reg/Dec Wr Exec Reg/Dec Mem Exec Reg/Dec Pipeline Wr Wr Exec Wr Exec Reg/Dec Wr Exec Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9

R-type Ifetch

R-type Ifetch

R-type Ifetch

Bubble Reg/Dec Ifetch

Insert a bubble into the pipeline to prevent 2 writes at the same cycle

The control logic can be complex. Lose instruction fetch and issue opportunity.

No instruction is started in Cycle 6!


128

Solution 2: Delay R-types Write by One Cycle

Delay R-types register write by one cycle:


Now R-type instructions also use Reg Files write port at Stage 5 Mem stage is a NOOP stage: nothing is being done.
1 R-type Ifetch 2 Reg/Dec 3 Exec 4 Mem 5 Wr

Cycle 1 Cycle 2 Clock R-type Ifetch R-type Reg/Dec Ifetch Load

Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9

Exec Reg/Dec Ifetch

Mem Exec Reg/Dec

Wr Mem Exec Reg/Dec Wr Mem Exec Reg/Dec Wr Mem Exec Wr Mem Wr

R-type Ifetch

R-type Ifetch

129

The Four Stages of Store


Cycle 1 Cycle 2 Cycle 3 Cycle 4

Store

Ifetch

Reg/Dec

Exec

Mem

Wr

Ifetch: Instruction Fetch

Fetch the instruction from the Instruction Memory

Reg/Dec: Registers Fetch and Instruction Decode Exec: Calculate the memory address Mem: Write the data into the Data Memory

130

The Three Stages of Beq


Cycle 1 Cycle 2 Cycle 3 Cycle 4

Beq

Ifetch

Reg/Dec

Exec

Mem

Wr

Ifetch: Instruction Fetch

Fetch the instruction from the Instruction Memory

Reg/Dec:

Registers Fetch and Instruction Decode

Exec:

compares the two register operand, select correct branch target address latch into PC

131

Pipelining Summary

All modern day processors use pipelining Pipelining doesnt help latency of single task, it helps throughput of entire workload Potential speedup: a really fast clock cycle and able to complete one instruction every clock cycle (CPI) Pipeline rate limited by slowest pipeline stage

Unbalanced pipe stages makes for inefficiencies The time to fill pipeline and time to drain it can impact speedup for deep pipelines and short code runs

Must detect and resolve hazards

Stalling negatively affects CPI (makes CPI greater than the ideal of 1)

132

4.9 Exception: Communication of I/O Devices and Processor

How the processor directs the I/O devices


Special I/O instructions - Must specify both the device and the command Memory-mapped I/O - Portions of the high-order memory address space are assigned to each I/O device. Read (lw) and writes (sw) to those memory addresses are interpreted as commands to the I/O devices - Load/stores to the I/O address space done only by the OS How the I/O device communicates with the processor Polling the processor periodically checks the status of an I/O device to determine its need for service - Processor is totally in control but does all the work - Can waste a lot of processor time due to speed differences Interrupt-driven the I/O device issues an interrupts to the processor to indicate that it needs attention

133

MIPS I/O Instructions

MIPS has 2 coprocessors: Coprocessor 0 handles exceptions including input and output interrupts, Coprocessor 1 handles floating point

Coprocessors have their own register sets so have instructions to move values between these registers and the CPUs registers

Register # Use BadVAddr 8 bad mem addr Count 9 timer Compare 11 timer compare Status 12 intr mask & enable bits Cause 13 excp type and pending intrs EPC 14 addr of instr causing excp

mfc0

rd, rt
0x10 0

#move from coprocessor 0


rt rd 0 0

mtc0 rt, rd
0x10 4

#move to coprocessor 0
rt rd 0 0 134

The Downsides of Polling

Input and output devices are very slow compared to the processor

These time lags are simulated in SPIM which measures time in instructions executed, not in real clock time After the transmitter starts to write a character, the transmitters ready bit becomes 0. It doesnt become ready again until the processor has executed a (large) fixed number of instructions. (You dont want to single step the simulator!)

Polling will execute the loop til ready code thousands of times. While the input or output is occurring, nothing else can be done a waste of resources. There is a better way

135

I/O Interrupts

An I/O interrupt is used to signal an I/O request for service


Can have different urgencies (so may need to be prioritized) Need to identity the device generating the interrupt An I/O interrupt is not associated with any instruction and does not prevent any instruction from completion

An I/O interrupt is asynchronous wrt instr execution

- You can pick your own convenient point to take an interrupt Advantage

User program progress is only halted during the actual transfer of I/O data to/from user memory space Cause an interrupt (I/O device) Detect an interrupt and save the proper information to resume after servicing the interrupt (processor)

Disadvantage special hardware is needed to


136

Additions to MIPS ISA for I/O

Coprocessor 0 records the information the software needs to handle exceptions (including interrupts)

EPC (register 14) holds the address+4 of the instruction that was executing when the exception occurred Status (register 12) exception mask and enable bits
15 8 Intr Mask
User mode Intr enable Excp level

1 0

- Intr Mask = 1 bit for each of 6 hw and 2 sw exception levels (1 enables exception at that level, 0 disables them) - User mode = 0 if running in kernel mode when exception occurred; 1 if running in user mode (fixed at 1 in SPIM) - Excp level = set to 1 (disable exceptions) when an exception occurs; should be reset by exception handler when done - Intr enable = 1 if exception are enabled; 0 if disabled
141

Additions to MIPS ISA, Cont

Cause (register 13) exception pending and type bits


31 15 8 6
Exception code

Branch delay

Pending exception (PI)

PI3 = recv intr PI2 = trans intr

- PI: bits set if exception occurs but not yet serviced


so can handle more than one exception occurring at same time, or records exception requests when exception are disabled

- Exception code: encodes reasons for exception


0 (INT) external interrupt (I/O device request) 4 (AdEL) address error trap (load or instr fetch) 5 (AdES) address error trap (store) 6 (IBE) bus error on instruction fetch trap 7 (DBE) bus error on data load or store trap 8 (Sys) syscall trap 9 (Bp) breakpoint trap 10 (RI) reserved (or undefined) instruction trap 12 (Ov) arithmetic overflow trap

142

MIPS Exception Return Instruction

Exception return sets the Excp level bit in coprocessor 0s Status register to 0 (reenabling exception) and returns to the instruction pointed to by coprocessor 0s EPC register

eret
0x10

#return from exception


1 0 0 0 0x18

143

Exceptions in General
user program normal control flow: sequential, jumps, branches, calls, returns Exception System Exception Handler

return from exception

Exception = unprogrammed control transfer

system takes action to handle the exception

- must record the address of the offending or next to execute instruction and save (and restore) user state

returns control to user after handling the exception


144

Two Types of Exceptions

Interrupts

caused by external events (i.e., request from I/O device) asynchronous to program execution may be handled between instructions simply suspend and resume user program caused by internal events

Traps

- exceptional conditions (e.g., arithmetic overflow, undefined instr.) - errors (e.g., hardware malfunction, memory parity error) - faults (e.g., non-resident page page fault)

synchronous to program execution condition must be remedied by the trap handler instruction may be retried (or simulated) and program continued or program may be aborted
145

Additions to MIPS ISA for Interrupts

Control signals to write EPC (EPCWrite), Cause and Status (Cause&StatusWrite) Hardware to record the type of interrupt in Cause Modify the finite state machine so that

the address of interrupt handler (8000 0180hex) can be loaded into the PC, so must increase the size of PC mux and save the address of the next instr in EPC

146

Additions to MIPS ISA for Traps


Control signals to write EPC (EPCWrite & IntrOrExcp), Cause and Status (Cause&StatusWrite) Hardware to record the type of trap in Cause Further modify the finite state machine so that

for traps, record the address of the current (offending) instruction in the EPC, so must undo the PC = PC + 4 done during fetch

147

How Control Detects Two Traps

Undefined instruction (RI) detected when no next state is defined in state 1 (decode) for the opcode value

Define the next state value for all undefined op values as new state 10

Arithmetic overflow (Ov) The overflow signal from the ALU is used in state 6 (if dont want to complete RegWrite) Need to modify the FSM in a similar fashion for remaining traps

Challenge is to handle the interactions between instructions and exception-causing events so that the control logic remains small and fast

- Complex interactions makes the control unit the most challenging aspect of hardware design, especially in pipelined processors
148

What Makes Pipelining Hard?


Examples of interrupts:

Interrupts cause great havoc!

Power failing, Arithmetic overflow, I/O device request, OS call, Page fault

There are 5 instructions executing in 5 stage pipeline when an interrupt occurs:


How to stop the pipeline? How to restart the pipeline? Who caused the interrupt?

Interrupts (also known as: faults, exceptions, traps) often require


surprise jump (to vectored address) linking return address saving of PSW (including CCs) state change (e.g., to kernel mode)

149

What Makes Pipelining Hard?


Interrupts cause great havoc!
What happens on interrupt while in delay slot ? Next instruction is not sequential solution #1: save multiple PCs Save current and next PC Special return sequence, more complex hardware solution #2: single PC plus Branch delay bit PC points to branch instruction

Stage
IF ID EX MEM

Problem that causes the interrupt


Page fault on instruction fetch; misaligned memory access; memory-protection violation Undefined or illegal opcode Arithmetic interrupt Page fault on data fetch; misaligned memory access; memory-protection violation 150

What Makes Pipelining Hard?

Interrupts cause great havoc!

Simultaneous exceptions in more than one pipeline stage, e.g., Load with data page fault in MEM stage Add with instruction page fault in IF stage Add fault will happen BEFORE load fault Solution #1 Interrupt status vector per instruction Defer check until last stage, kill state update if exception Solution #2 Interrupt ASAP Restart everything that is incomplete

Another advantage for state update late in pipeline!


151

What Makes Pipelining Hard?

Interrupts cause great havoc!

Heres what happens on a data page fault. 1 i i+1 i+2 i+3 i+4 i+5 i+6 trap -> trap handler -> F 2 D F 3 X D F 4 M X D F 5 W M X D F W <- page fault M X D F W <- squash M X D F W <- squash M X D W <- squash M X W M W
152

What Makes Pipelining Hard?


Complex Addressing Modes and Instructions

Complex Instructions

Address modes: Auto increment causes register change during instruction execution

Interrupts? Need to restore register state Adds WAR and WAW hazards since writes are no longer the last stage.

Memory-Memory Move Instructions


Must be able to handle multiple page faults Long-lived instructions: partial state save on interrupt

Condition Codes

153

Datapath with Controls to Handle Exceptions

154

Exception Handling Example


overflow exception

155

Exception Handling Example


start of exception handling routine

156