Beruflich Dokumente
Kultur Dokumente
DEE 1053 Computer Organization Lecture 5: The Processor: Datapath and Control
Dr. Tian-Sheuan Chang
tschang@twins.ee.nctu.edu.tw Dept. Electronics Engineering National Chiao-Tung University
Outline
Logic Design Conventions Basic MIPS implementation Building a Datapath Simple Implementation Multicycle Implementation Exceptions Microprogramming for control design
Institute of Electronics, National Chiao Tung University
Sequential logic
Output depends on current inputs and current states State element to store the states
D Flip Flops
state changes only on a clock edge (edge-triggered methodology)
Falling edge
cycle time
Clock period
DEE 1050 Lecture 5: Processor
Rising edge
Clocking Methodology
Clocking methodology
Defines when signals can be read and when they can be written Mainstream: An edge triggered methodology
Determine when data is valid and stable relative to the clock
Typical execution:
read contents of some state elements, send values through some combinational logic write results to one or more state elements
State element 1 Combinational logic State element 2
Clock cycle
Register File
Built using D flip-flops
Read register number 1 Read register number 2 Write register Write data Register file
Read register number 1 Register 0 Register 1 ... Register n 2 M u x Read data 1
Read data 1
Register n 1
Read data 2
Write
M u x Read data 2
Abstraction
Make sure you understand the abstractions! Sometimes it is easy to think you do, when you dont
Select A31
Select
B31
M u x
C31
32 M u x 32 C
A30
32
B30
M u x . . .
C30 . . .
A0
B0
M u x
C0
Register File
Note: we still use the real clock to determine when to write
Write C Register 0 . . . D C Register 1 D . . . 0 1 Register number n-to-2 n decoder n1 n
Outline
Logic Design Conventions Basic MIPS implementation Building a Datapath Simple Implementation Multicycle Implementation Exceptions Microprogramming for control design
Institute of Electronics, National Chiao Tung University
Operand Fetch
Fetch the operands
Execution (EX)
Perform the operation on the operands
Next Instruction
Determine where to get next instruction
11
Overview of Implementations
Generic Implementation (high similarity between instructions)
use the program counter (PC) to supply instruction address get the instruction from memory read registers use the instruction to decide exactly what to do
Institute of Electronics, National Chiao Tung University
12
13
Add
14
Add Add
15
Outline
Logic Design Conventions Basic MIPS implementation Building a Datapath Simple Implementation Multicycle Implementation Exceptions Microprogramming for control design
Institute of Electronics, National Chiao Tung University
18
Datapath Components
Instruction address Instruction Instruction memory a. Instruction memory b. Program counter c. Adder PC Add Sum
Two state elements are needed to store and access instructions, and an adder is needed to compute the next instruction address.
20
Add
21
steps:
Read two registers Register file Perform an ALU operations ALU Write the result into a register Register file
Datapath component
Register file
A collection of registers in which any register can be read or written by specifying the number of register (register address) in the file Needs a write control signal RegWrite How many ports are required?
ALU
22
Data
RegWrite
a. Registers
b. ALU
The two elements needed to implement R-format ALU operation are the register file and the ALU
DEE 1050 Lecture 5: Processor 23
Read register 1 Instruction Read register 2 Registers Write register Write Data
24
Datapath components
ALU (for address calculation) Register set Sign extension unit
Can you figure out why this?
data memory
DEE 1050 Lecture 5: Processor 25
Address
Write data
The two units needed to implement loads and stores are the data memory unit and the sign-extension unit, in addition to the register file and ALU
DEE 1050 Lecture 5: Processor 26
Instruction
Data memory
16
Sign extend
32
27
Offset
Shift left by 2-bits word alignment
Steps
Read one or two registers Register file Branch/Jump address calculation Adder + sign extension unit+2-bit shift Compare the register contents to check the condition is true or not ALU (zero output) Write the result into PC If branch, New PC = branch address If jump, replacing the lower 28-bit of the PC with the 26-bit immediate values from the instructions shifted by 2-bits
29
The datapath for a branch uses an ALU for evaluation of the branch condition and a separate adder for computing the branch target as the sum of the incremented PC and the sign-extended, lower 16 bits of the I instruction (the branch displacement) shifted left 2 bits.
DEE 1050 Lecture 5: Processor 30
Outline
Logic Design Conventions Basic MIPS implementation Building a Datapath Simple Implementation
ALU control MUX control
Institute of Electronics, National Chiao Tung University
32
Selecting the operations to perform (ALU, read/write, etc.) Controlling the flow of data (multiplexor inputs)
MUX control
Which source registers and destination registers ALU input source Input source of destination register Input source of PC
Branch eq
Subtract for comparison add/subtract for address calculation beq $t1, $t2, offset
R-type
and/or set-on-less-than
36
Reduce the size of main control but may increase the delay
Instroction Instruction opcode ALUOp operation Function Code LW 00 Load word xxxxxx SW 00 store word xxxxxx Branch equal 01 branch equal xxxxxx R-type 10 add 100000 R-type 10 subtract 100010 R-type 10 AND 100100 R-type 10 OR 100101 R-type 10 set-on-less-than 101010
DEE 1050 Lecture 5: Processor
Desired ALU control ALU action input add 0010 add 0010 subtract 0110 add 0010 subtract 0110 and 0000 or 0001 set-on-less-than 0111
I-Format: load/store 35 or 43 rs Load/store 31:26 25:21 I-Format: branch 4 rs 31:26 25:21 DEE 1050 Lecture 5: Processor
rt 20:16
address 15:0
rt 20:16
address 15:0
39
rt 20:16
address 15:0
40
rt 20:16
address 15:0
41
The function of each of the seven control signals. When the 1-bit control to a twoway multiplexor is asserted, the multiplexor selects the input corresponding to 1. Otherwise, if the control is deserted, the multiplexor selects the 0 input. Remember that the state elements all have the clock as an implicit input and that the clock is used in controlling writes.
43
0 Add 4 RegDst Branch MemRead MemtoReg ALUOp MemWrite ALUSrc RegWrite Read register 1 Read register 2 Write register Write data Shift left 2 Add ALU result M u x 1
Instruction [3126]
Control
PC
Read data 1 Zero Read data 2 0 M u x 1 ALU ALU result Address Read data 1 M u x 0
Registers
Instruction [150]
16
Sign extend
32
Instruction [50]
Memto- Reg Mem Mem Write Read Write Branch ALUOp1 ALUp0 Instruction RegDst ALUSrc Reg R-format 1 0 0 1 0 0 0 1 0 lw 0 1 1 1 1 0 0 0 0 sw X 1 X 0 0 1 0 0 0 beq X 0 X 0 0 0 1 0 1 DEE 1050 Lecture 5: Processor
0p5
RegDst ALUSrc
Control
0p0
ALU0p1 ALU0p0
45
Outputs
The control function for the simple one-clock implementation is completely specified by this truth table. The top half of the table gives the combinations of input signals that correspond to the four opcodes that determine the control output setting. (Remember that Op (5-0) corresponds to bits 31-26 of the instruction, which is the opcode field.) The bottom portion of the table gives the outputs.
46
47
49
address 25:0
50
Clock cycle
memory (200ps), ALU and adders (100ps), register file access (50ps)
PCSrc M u x Add Shift left 2 Read register 1 ALUSrc Read data 1 ALU operation MemWrite Zero ALU ALU result Address Read data MemtoReg ALU result
Add 4
PC
Read register 2 Registers Read Write data 2 register Write data RegWrite 16 32
M u x
M u x
Data memory
Re gis te r a cce s s
Ins truction Ins truction Re gis te r ALU Da ta Re gis te r Tota l cla s s m e m ory re a d ope ra tion m e m ory write R-type 200 50 100 0 50 400 Loa d word 200 50 100 200 50 600 S tore word 200 50 100 200 550 Bra nch 200 50 100 0 350 J um p 200 200
DEE 1050 Lecture 5: Processor 54
CPU cycle for variable length = 600 25% + 550 10% + 400 45% + 350 15% + 200 5% = 447.5ps CPU performance variable CPU Timefixed 600 = = = 1.34 CPU performancefixed CPU Time variable 447.5
DEE 1050 Lecture 5: Processor 55
Area problems
wasteful of area: duplicate resources
56
Another solution
Pipelining (Ch. 6)
Overlapping the execution of multiple instructions
PC
Instruction register
A ALU B ALUOut
Register #
Outline
Basic MIPS implementation Logic Design Conventions Building a Datapath Simple Implementation Multicycle Implementation Exceptions Microprogramming for control design
Institute of Electronics, National Chiao Tung University
58
Multicycle Implementation
Adv.
Allow instructions to take different number of cycles Share function units within the execution of a single instructions
Institute of Electronics, National Chiao Tung University
Differences compared to 1-cycle design Only single memory unit, single ALU
Add mux for sharing
59
Add 4 Shift left 2 Read register 1 ALUSrc Read data 1 ALU operation Add ALU result
M u x
PC
Read register 2 Registers Read Write data 2 register Write data RegWrite 16 32
M u x
M u x
Data memory
60
PC
Instruction register
A ALU B ALUOut
Register #
61
Multicycle Datapath
62
MultiCycle Flow
63
Control Signals
Control for programmer-visible state units
Write control for PC, memory, register, IR Read control for memory
Institute of Electronics, National Chiao Tung University
ALU control
Same as the one used in single cycle datapath
MUX control
64
Register ALUout
Branch target after computed
Jump address
{NextPC[31:28], Instruction[25:0], 2b00}
Instruction [2521] Instruction [2016] Instruction [150] Instruction register Instruction [150] Memory data register
0 M Instruction u x [1511] 1 0 M u x 1 16
Read data 1 Read register 2 Registers Write Read register data 2 Write data
B 4
0 1M u 2 x 3
Sign extend
32
Shift left 2
In order to accomplish this we must break up the instruction. (kind of like introducing variables when programming)
70
op
71
We define each instruction from the ISA perspective (do this!) Break it down into steps following our rule that data flows through at most one major functional unit (e.g., balance work across steps) Introduce new registers as needed (e.g, A, B, ALUOut, MDR, etc.) Finally try and pack as much work into each step (avoid unnecessary cycles) while also trying to share steps where possible (minimizes control, helps to simplify solution) Result: Our books multicycle Implementation!
72
We aren't setting any control lines based on the instruction type (we are busy "decoding" it in our control logic) Control: ALUSrcA = 0, ALUSrcB = 0, ALUOp = 00 (add)
DEE 1050 Lecture 5: Processor
ALU is performing one of three functions, based on instruction type Memory Reference: ALUOut <= A + sign-extend(IR[15:0]);
Control: ALUSrcA = 1, ALUSrcB = 10, ALUOp = 00 (add)
Jump
PC = {PC[31:28], Instruction[25:0], 2b00} Assert PCWrite
MDR <= Memory[ALUOut]; or Memory[ALUOut] <= B; R-type instructions finish Reg[IR[15:11]] <= ALUOut; The write actually takes place at the end of the cycle on the edge
Write-back step
Reg[IR[20:16]] <= MDR; Which instruction needs this?
Institute of Electronics, National Chiao Tung University
Summary:
Simple Questions
How many cycles will it take to execute this code? lw $t2, 0($t3) lw $t3, 4($t3) beq $t2, $t3, Label add $t5, $t2, $t3 sw $t5, 8($t3) ...
#assume not
Label:
What is going on during the 8th cycle of execution? In what cycle does the actual addition of $t2 and $t3 takes place?
PCWriteCond
PCSource Outputs Control Op [50] ALUOp ALUSrcB ALUSrcA RegWrite RegDst 26 Shift left 2 28 Jump address [310] 0 M 1 u x 2
Instruction [25-0] Instruction [3126] Address Memory MemData Write data Instruction [2521] Instruction [2016] Instruction [150] Instruction register Instruction [150] Memory data register Read register 1 Read data 1 0 M u x 1
PC
0 M u x 1
PC [3128]
0 M Instruction u x [1511] 1 0 M u x 1 16
B 4
0 1M u 2 x 3
Sign extend
32
Shift left 2
ALU control
Instruction [50]
81
CPI
0.25*5 + 0.1*4 + 0.52*4+0.11*3+0.02*3 = 4.12 Worst case: 5 if all instructions use the same number of cycles
82
84
85
Memory-Reference FSM
Fig. 5.33
Institute of Electronics, National Chiao Tung University
86
Other FSM
Fig. 5.34 Branch FSM
Institute of Electronics, National Chiao Tung University
Complete FSM
Note:
dont care if not mentioned asserted if name only otherwise exact value
Institute of Electronics, National Chiao Tung University
88
Implementation
Implementation:
PCWrite PCWriteCond IorD MemRead MemWrite Control logic IRWrite MemtoReg PCSource ALUOp Outputs ALUSrcB ALUSrcA RegWrite RegDst NS3 NS2 NS1 NS0
ROM or PLA
Inputs
Op5
Op4
Op3
Op2
Op1
Op0
S3
S2
S1
State register
S0
PLA Implementation
If I picked a horizontal or vertical line could you explain it?
Op5 Op4 Op3 Op2 Op1 Op0 S3 S2 S1 S0 PCWrite PCWriteCond IorD MemRead MemWrite IRWrite MemtoReg PCSource1 PCSource0 ALUOp1 ALUOp0 ALUSrcB1 ALUSrcB0 ALUSrcA RegWrite RegDst NS3 NS2 NS1 NS0
ROM Implementation
ROM = "Read Only Memory"
Institute of Electronics, National Chiao Tung University
ROM Implementation
How many inputs are there? 6 bits for opcode, 4 bits for state = 10 address lines (i.e., 210 = 1024 different addresses) How many outputs are there? 16 datapath-control outputs, 4 state bits = 20 outputs ROM is 210 x 20 = 20K bits (and a rather unusual size)
Institute of Electronics, National Chiao Tung University
Rather wasteful, since for lots of the entries, the outputs are the same i.e., opcode is often ignored
ROM vs PLA
Break up the table into two parts 4 state bits tell you the 16 outputs, 24 x 16 bits of ROM 10 bits tell you the 4 next state bits, 210 x 4 bits of ROM Total: 4.3K bits of ROM PLA is much smaller can share product terms only need entries that produce an active output can take into account don't cares Size is (#inputs #product-terms) + (#outputs #product-terms) For this example = (10x17)+(20x17) = 510 PLA cells PLA cells usually about the size of a ROM cell (slightly bigger)
PLA or ROM
Outputs
Input 1 State Adder Address select logic Op[5 0] Instruction register opcode field
Details
Op 000000 000010 000100 100011 101011 Dispatch ROM 1 Opcode name R-format jmp beq lw sw Value 0110 1001 1000 0010 0010 Op 100011 101011 Dispatch ROM 2 Opcode name lw sw Value 0011 0101
PLA or ROM 1 State Adder 3 Mux 2 1 AddrCtl 0 0
Dispatch ROM 2
State number 0 1 2 3 4 5 6 7 8 9
Address-control action Use incremented state Use dispatch ROM 1 Use dispatch ROM 2 Use incremented state Replace state number by 0 Replace state number by 0 Use incremented state Replace state number by 0 Replace state number by 0 Replace state number by 0
Value of AddrCtl 3 1 2 3 0 0 3 0 0 0
Microinstruction format
Partition into fields Each control signals correspond to 1 bit in microinstructions Each field of the microinstructions responsible for specifying a nonoverlapping set of control signals
96
Microprogramming
Control unit Microcode memory PCWrite PCWriteCond IorD MemRead MemWrite IRWrite BWrite MemtoReg PCSource ALUOp ALUSrcB ALUSrcA RegWrite RegDst AddrCtl
Datapath
Outputs
Microprogramming
A specification methodology
appropriate if hundreds of opcodes, modes, cycles, etc. signals specified symbolically using microinstructions
Label Fetch Mem1 LW2 ALU control Add Add Add SRC1 SRC2 PC 4 PC Extshft Read A Extend Register control Memory Read PC PCWrite control ALU Sequencing Seq Dispatch 1 Dispatch 2 Seq Fetch Fetch Seq Fetch Fetch Fetch
Institute of Electronics, National Chiao Tung University
Read ALU Write MDR Write ALU A A B Write ALU B ALUOut-cond Jump address
Will two implementations of the same architecture have the same microcode? What would a microassembler do?
DEE 1050 Lecture 5: Processor
Microinstruction format
Value Add Subt Func code PC A B 4 Extend Extshft Read Write ALU Signals active ALUOp = 00 ALUOp = 01 10 =0 =1 = 00 = 01 = 10 = 11 ALUOp = ALUSrcA ALUSrcA ALUSrcB ALUSrcB ALUSrcB ALUSrcB
SRC1
SRC2
Read PC Memory Read ALU Write ALU ALU PC write control ALUOut-cond jump address Seq Fetch Dispatch 1 Dispatch 2
Sequencing
RegWrite, RegDst = 1, MemtoReg = 0 RegWrite, RegDst = 0, MemtoReg = 1 MemRead, lorD = 0 MemRead, lorD = 1 MemWrite, lorD = 1 PCSource = 00 PCWrite PCSource = 01, PCWriteCond PCSource = 10, PCWrite AddrCtl = 11 AddrCtl = 00 AddrCtl = 01 AddrCtl = 10
Comment Cause the ALU to add. Cause the ALU to subtract; this implements the compare for branches. Use the instruction's function code to determine ALU control. Use the PC as the first ALU input. Register A is the first ALU input. Register B is the second ALU input. Use 4 as the second ALU input. Use output of the sign extension unit as the second ALU input. Use the output of the shift-by-two unit as the second ALU input. Read two registers using the rs and rt fields of the IR as the register numbers and putting the data into registers A and B. Write a register using the rd field of the IR as the register number and the contents of the ALUOut as the data. Write a register using the rt field of the IR as the register number and the contents of the MDR as the data. Read memory using the PC as address; write result into IR (and the MDR). Read memory using the ALUOut as address; write result into MDR. Write memory using the ALUOut as address, contents of B as the data. Write the output of the ALU into the PC. If the Zero output of the ALU is active, write the PC with the contents of the register ALUOut. Write the PC with the jump address from the instruction. Choose the next microinstruction sequentially. Go to the first microinstruction to begin a new instruction. Dispatch using the ROM 1. Dispatch using the ROM 2.
Lots of encoding:
send the microinstructions through logic to get control signals uses less memory, slower
Microcode: Trade-offs
Distinction between specification and implementation is sometimes blurred Specification Advantages:
Easy to design and write Design architecture and microcode in parallel
Institute of Electronics, National Chiao Tung University
Historical Perspective
In the 60s and 70s microprogramming was very important for implementing machines This led to more sophisticated ISAs and the VAX In the 80s RISC processors based on pipelining became popular Pipelining the microinstructions is also possible! Implementations of IA-32 architecture processors since 486 use:
hardwired control for simpler instructions (few cycles, FSM control implemented using PLA or random logic) microcoded control for more complex instructions (large numbers of cycles, central control store)
Institute of Electronics, National Chiao Tung University
The IA-64 architecture uses a RISC-style ISA and can be implemented without a large central control store
102
Pentium 4
Chapter 7
Secondary cache and memory interface
Integer datapath
Control
Chapter 6
Control
Pipelining is used for the simple instructions favored by compilers Simply put, a high performance implementation needs to ensure that the simple instructions execute quickly, and that the burden of the complexities of the instruction set penalize the complex, less frequently used, instructions
103
Pentium 4
Integer datapath
Control
Control
Processor executes simple microinstructions, 70 bits wide (hardwired) 120 control lines for integer datapath (400 for floating point) If an instruction requires more than 4 microinstructions to implement, control from microcode ROM (8000 microinstructions) Its complicated!
104
Exception
Control is the most challenging part of processor design
Institute of Electronics, National Chiao Tung University
Exception
Unexpected event from within the processor E.g. arithmetic flow, undefined instructions, invoke the OS from user program
Interrupt
An except that comes from outside of the processor E.g. I/O devices, hardware malfunctions
These two terms are mixed in different processors Why are they so hard
Detect exceptional conditions and take the action is often on the critical timing path of a machine
DEE 1050 Lecture 5: Processor 105
2.2 OS terminate the program or continue its execution using the EPC
DEE 1050 Lecture 5: Processor 106
By vector interrupts
Each exception vector address corresponds to its reasons
Undefined instruction Arithmetic overflow C000 0000 C000 0020
107
Cause registers
Register to record the cause of the exception, use the low order two bits for undefined instruction and arithmetic overflow
Control signals:
Register write control: EPCWrite, CauseWrite, Interrupt/exception cause selection: IntCause
Exception address
Constant value: 8000 0180
109
Arithmetic overflow
Overflow signals from ALU is used
110
111
Chapter 5 Summary
If we understand the instructions We can build a simple processor! If instructions take different amounts of time, multi-cycle is better Datapath implemented using:
Combinational logic for arithmetic State holding elements to remember bits
Institute of Electronics, National Chiao Tung University
112