Beruflich Dokumente
Kultur Dokumente
Building a Processor 2
Heiner Litz
https://canvas.ucsc.edu/courses/19290
2
Review
3
Initial Processor Datapath
n Instruction = Memory[PC]
n Fetch the instruction from memory
n Always 32-bits
5
Fetching the Instruction
Increment by 4 for
32b
next instruction
32-bit
register
6
Arithmetic Instructions
n Read two register operands
n Perform arithmetic/logical operation
n Write register result
7
ORI Instruction
n OR immediate instruction
I-type
immediate[11:0] rs1 funct3 rd opcode
8
Datapath: ORI Instruction
n Read data 2 is ignored for immediates
n ALUsrc and ALUOp set based on instruction
I-type
immediate[11:0] rs1 funct3 rd opcode
RegWrite
ALUO p
Instruction [19-15] Read
register 1
Read
Instruction [24-20] data 1 ALUSrc
Read
register 2 Zero
Instruction
Registers Read ALU ALU
[31– 0] Instruction [11-7] 0
W rite data 2 result
register M
u
W rite x
data 1
16 Sign 32
Instruction [31-20]
or Zero
extend
9
Branch Instruction
n Branch instruction: beq rs1, rs2, immediate
S/B-type
imm[11:5] rs2 rs1 funct3 imm[4:0] opcode
10
Datapath for the PC
imm[11:5] rs2 rs1 funct3 imm[4:0] opcode
30
0
M
u
x
30 ALU
Add 1
result
Add
1
Branch
Zero
Read
PC
address
00
Instruction
[31– 0]
Instruction
memory
16 30
Instruction [15– 0] Sign
extend
12
Control
n State free
n Every instruction takes a single cycle
n Just decode instruction bits
<prev>
RegWrite
<prev>
ALUO p <prev>
Instruction [19-15] Read M emWrite
PC
Read register 1
Read
<prev> <prev>
address
Instruction [24-20] data 1 ALUSrc M emtoReg
Read
register 2 Zero
Instruction
Registers Read ALU ALU
[31– 0] Instruction [11-7] 0 Read
W rite data 2 result Address 1
Instruction register M data
u M
memory u
W rite x
Data x
data 1 m em ory 0
Write
data
16 32
Instruction [15– 0] Sign
extend M emRead
<prev>
<prev>
14
Control: addu
P C [31– 28 ] Instruction [25– 0 ] 0 0 0 1
M M
u u
x x
ALU
Add 1 0
result
Add
Shift
left 2 Jump
4 0
Branch
0
1
RegWrite
<op>
ALUO p 0
Instruction [19-15] Read M emWrite
PC
Read register 1
Read
0 0
address
Instruction [24-20] data 1 ALUSrc M emtoReg
Read
register 2 Zero
Instruction
Registers Read ALU ALU
[31– 0] Instruction [11-7] 0 Read
W rite data 2 result Address 1
Instruction register M data
u M
memory u
W rite x
Data x
data 1 m em ory 0
Write
data
16 32
Instruction [31-20] Sign
extend M emRead
X 0
15
Control: Load
P C [31– 28 ] Instruction [25– 0 ] 0 0 0 1
M M
u u
x x
ALU
Add 1 0
result
Add
Shift
left 2 Jump
4 0
Branch
0
1
RegWrite
Add
ALUO p 0
Instruction [19-15] Read M emWrite
PC
Read register 1
Read
1 1
address
Instruction [24-20] data 1 ALUSrc M emtoReg
Read
register 2 Zero
Instruction
Instruction [11-7] Registers Read ALU ALU
[31– 0] 0 Read
W rite data 2 result Address 1
Instruction register M data
u M
memory u
W rite x
Data x
data 1 m em ory 0
Write
data
16 32
Instruction [31-20] Sign
extend M emRead
1 1
16
Putting It All Together:
Our First Processor
P C [31– 28 ] Instruction [25– 0 ] 0 0 0 0
M M
u u
x x
ALU
Add 1 1
result
Add
Shift
left 2 Jump
4 Branch
M emRead
Instruction [6-0] M emtoReg
Control ALUO p
M emWrite
ALUSrc
RegWrite
Instruction [5– 0]
17
How to generate control signals?
Branch
M emRead
M emtoReg
n Consider the hypothetical example:
Instruction [6-0] co ntrol ALUO p
M emWrite
ALUSrc
RegWrite
n MemWrite equals 1 if:
Instruction[0] & Instruction[2] &
! Instruction[5]
n Build using combinatorial logic
Instruction[0]
Instruction[2] MemWrite
Instruction[5]
18
Single Cycle Processor
Performance
n Functional unit delay
n Memory: 200ps
n ALU and adders: 200ps
n Register file: 100 ps
n Cons
n Cycle time is the worst case path ® long cycle times
n Worst case = load
n Hardware is underutilized
n ALU and memory used only for a fraction of clock cycle
n Not well amortized!
n Best possible CPI is 1
20
Variable Clock Single Cycle
Processor Performance
n Instruction Mix Instructio Instructio Register ALU Data Register Total
n n read operation memory write
n 45% ALU Class memory
n 25% loads
n 10% stores R-type 200 100 200 100 600
n 15% branches load 200 100 200 200 100 800
n 5% jumps store 200 100 200 200 700
branch 200 100 200 500
jump 200 200
21
Key Tools for System Architects
1. Pipelining
2. Parallelism
3. Out-of-order execution
4. Prediction
5. Caching
6. Indirection
7. Amortization
8. Redundancy
9. Specialization
10. Focus on the common case
22
Pipelining: The Laundry Analogy
n Ann, Brian, Cathy, Dave doing laundry
23
Single-cycle Laundry
6 PM 7 8 9 10 11 Midnight
Time
T
a
30 40 20 30 40 20 30 40 20 30 40 20
s
k A
O
r B
d
e C
r
D
Single-cycle laundry takes 6 hours for 4 loads
24
Pipelined Laundry
6 PM 7 8 9 10 11 Midnight
Time
30 40 40 40 40 20
T
a A
s
k
B
O
r
d C
e
r D
Pipelined laundry takes 3.5 hours for 4 loads
25
Lessons from Laundry Analogy
6 PM 7 8 9 n Pipelining doesn’t help latency of
Time single task, it helps throughput of
entire workload
T 30 40 40 40 40 20 n Multiple tasks operating
a
simultaneously
s
k
A n Potential speedup = Number pipe
stages
O
r B n Pipeline rate limited by slowest
pipeline stage
d
e Unbalanced lengths of pipe stages
C
n
r reduces speedup
n Time to “fill” pipeline and time to
D “drain” it reduces speedup
26
Another Analogy:
Model T Assembly Line
27
Pipelining the Processor
n 5 stages, one clock cycle per stage
n IF: instruction fetch from memory
n ID: instruction decode & register read
n EX: execute operation or calculate address
n MEM: access memory operand
n WB: write result back to register
Cycle 1 Cycle 2 Cycle Cycle 4 Cycle 5
3
lw IF RF/ID EX MEM WB
28
Pipelining the Processor
n Overlap instructions in different stages
n All hardware used all the time (amortization)
n Clock cycle is fast
n CPI is still 1
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Clock
29
To Be Continued
n Pipelined datapath and control
30