Midterm on May 8th. Midterm review on May 6th. Come to class with questions. Midterm will cover everything before it It will mostly resemble the homeworks and the reading quizzes. We will send out a reading quiz compendium on Thursday It will be challenging. It will be curved. 2 The Final is Also Coming (but more slowly) Despite what the online schedule says, we have only one final time and it is: 6/10/2014 8:00am-11:00am. 3 Implementing a MIPS Processor Readings: 4.1-4.9 4 Goals for this Class Understand how CPUs run programs How do we express the computation the CPU? How does the CPU execute it? How does the CPU support other system components (e.g., the OS)? What techniques and technologies are involved and how do they work? Understand why CPU performance (and other metrics) varies How does CPU design impact performance? What trade-offs are involved in designing a CPU? How can we meaningfully measure and compare computer systems? Understand why program performance varies How do program characteristics affect performance? How can we improve a programs performance by considering the CPU running it? How do other system components impact program performance? 5 Goals Understand how the 5-stage MIPS pipeline works See examples of how architecture impacts ISA design Understand how the pipeline affects performance Understand hazards and how to avoid them Structural hazards Data hazards Control hazards Processor Design in Two Acts Act I: A single-cycle CPU Foreshadowing Act I: A Single-cycle Processor Simplest design Not how many real machines work (maybe some deeply embedded processors) Figure out the basic parts; what it takes to execute instructions Act II: A Pipelined Processor This is how many real machines work Exploit parallelism by executing multiple instructions at once. 8 Target ISA We will focus on part of MIPS Enough to run into the interesting issues Memory operations A few arithmetic/Logical operations (Generalizing is straightforward) BEQ and J This corresponds pretty directly to what youll be implementing in 141L. 9 Basic Steps for Execution Fetch an instruction from the instruction store Decode it What does this instruction do? Gather inputs From the register file From memory Perform the operation Write back the outputs To register file or memory Determine the next instruction to execute 10 The Processor Design Algorithm Once you have an ISA Design/Draw the datapath Identify and instantiate the hardware for your architectural state Foreach instruction Simulate the instruction Add and connect the datapath elements it requires Is it workable? If not, fix it. Design the control Foreach instruction Simulate the instruction What control lines do you need? How will you compute their value? Modify control accordingly Is it workable? If not, fix it. Youve already done much of this in 141L. Arithmetic; R-Type Inst = Mem[PC] REG[rd] = REG[rs] op REG[rt] PC = PC + 4 bits 31:26 25:21 20:16 15:11 10:6 5:0 name op rs rt rd shamt funct # bits 6 5 5 5 5 6 12 ADDI; I-Type PC = PC + 4 REG[rt] = REG[rs] op SignExtImm bits 31:26 25:21 20:16 15:0 name op rs rt imm # bits 6 5 5 16 13 Load Word PC = PC + 4 REG[rt] = MEM[signextendImm + REG[rs]] bits 31:26 25:21 20:16 15:0 name op rs rt immediate # bits 6 5 5 16 14 Store Word PC = PC + 4 MEM[signextendImm + REG[rs]] = REG[rt] bits 31:26 25:21 20:16 15:0 name op rs rt immediate # bits 6 5 5 16 15 Branch-equal; I-Type PC = (REG[rs] == REG[rt]) ? PC + 4 + SignExtImmediate *4 : PC + 4; bits 31:26 25:21 20:16 15:0 name op rs rt displacement # bits 6 5 5 16 16 A Single-cycle Processor Performance refresher ET = IC * CPI * CT Single cycle CPI == 1; That sounds great Unfortunately, Single cycle CT is large Even RISC instructions take quite a bite of effort to execute This is a lot to do in one cycle 17 Our Hardware is Mostly Idle Cycle time = 18 ns Slowest module (alu) is ~6ns Processor Design in Two Acts Act II: A pipelined CPU Pipelining Letter Answer A Allows the execution of multiple instructions to overlap B Prevents branch articulation C Significantly decreases the amount of time it takes to execute a particular instruction D Significantly increases the amount of time it takes to implement a particular instruction E A and D 19 Pipelining Letter Answer A Increases instruction count B Reduces CPI C Reduces cycle time D Has no effect on performance E B and C 20 Data hazards Letter Answer A Occur because a value is not ready when its needed B Occur because the next PC is not yet known. C Cannot be removed. D A and B E All of the above 21 Stalling a processor Letter Answer A Reduces CPI and increases instruction count. B Means that instructions early in the pipeline stop making progress C Can resolve some hazards. D B and C E A and C 22 Forwarding Letter Answer A Is just for email. B Allows the processor to resolve control hazards. C Improves CPI D Reduces cycle time E Interacts poorly with stalling. 23 24 Pipelining Review Pipelining Break up the logic with pipeline registers into pipeline stages Each stage can act on different instruction/data States/Control Signals of instructions are hold in pipeline registers (latches) 25 2ns 2ns 2ns 2ns 2ns 10ns l a t c h l a t c h l a t c h l a t c h l a t c h l a t c h l a t c h l a t c h Pipelining 26 2ns 2ns 2ns 2ns 2ns l a t c h l a t c h l a t c h l a t c h l a t c h l a t c h cycle #1 2ns 2ns 2ns 2ns 2ns l a t c h l a t c h l a t c h l a t c h l a t c h l a t c h cycle #2 2ns 2ns 2ns 2ns 2ns l a t c h l a t c h l a t c h l a t c h l a t c h l a t c h cycle #3 2ns 2ns 2ns 2ns 2ns l a t c h l a t c h l a t c h l a t c h l a t c h l a t c h cycle #4 2ns 2ns 2ns 2ns 2ns l a t c h l a t c h l a t c h l a t c h l a t c h l a t c h cycle #5 Performance of a pipeline processor If we have 500 instructions , whats the speedup of a 5-stage pipeline processor with 2 ns cycle time v.s. a single-cycle processor with 10 ns cycle time? A. 5 B. 4.96 C. 2.78 D. 1 E. None of the above 27 Recap: Clock A hardware signal defines when data is valid and stable Think about the clock in real life! We use edge-triggered clocking Values stored in the sequential logic is updated only on a clock edge 28 sequential logic combinational logic The 5-Stage MIPS Pipeline Instruction Fetch Read the instruction Decode Figure out the incoming instruction? Fetch the operands from the register file Execution: ALU Perform ALU functions Memory access Read/write data memory Write back results to registers Write to register file 36 Execution (EXE) Instruction Fetch (IF) Instruction Decode (ID) Memory Access (MEM) Write Back (WB) Pipelined Datapath Read Address Instruction Memory Add P C 4 Write Data Read Addr 1 Read Addr 2 Write Addr Register File Read Data 1 Read Data 2 16 32 ALU Shift left 2 Add Data Memory Address Write Data Read Data Sign Extend Pipelined datapath 39 Read Address Instruction Memory ALU Write Data 4 Add Read Data 1 Read Data 2 Read Reg 1 Read Reg 2 Write Reg Register File inst[25:21] inst[20:16] inst[15:11] inst[31:0] m u x 0 1 m u x 0 1 sign- extend 32 16 Data Memory Address Read Data m u x 1 0 Write Data m u x 1 0 Add Shift left 2 ALUSrc MemtoReg MemRead RegDst RegWrite MemWrite PCSrc Zero PCSrc = Branch & Zero IF/ID ID/EX EX/MEM MEM/WB Instruction Fetch Instruction Decode Execution Memory Access Write Back Will this work? ALUop Pipelined datapath 40 Read Address Instruction Memory ALU Write Data 4 Add Read Data 1 Read Data 2 Read Reg 1 Read Reg 2 Write Reg Register File inst[25:21] inst[20:16] inst[15:11] inst[31:0] m u x 0 1 m u x 0 1 sign- extend 32 16 Data Memory Address Read Data m u x 1 0 Write Data m u x 1 0 Add Shift left 2 ALUSrc MemtoReg MemRead RegDst RegWrite MemWrite PCSrc Zero IF/ID ID/EX EX/MEM MEM/WB add $1, $2, $3 lw $4, 0($5) sub $6, $7, $8 sub $9,$10,$11 sw $1, 0($12) ALUop Pipelined datapath 41 Read Address Instruction Memory ALU Write Data 4 Add Read Data 1 Read Data 2 Read Reg 1 Read Reg 2 Write Reg Register File inst[25:21] inst[20:16] inst[15:11] inst[31:0] m u x 0 1 m u x 0 1 sign- extend 32 16 Data Memory Address Read Data m u x 1 0 Write Data m u x 1 0 Add Shift left 2 ALUSrc MemtoReg MemRead RegDst RegWrite MemWrite PCSrc Zero IF/ID ID/EX EX/MEM MEM/WB add $1, $2, $3 lw $4, 0($5) sub $6, $7, $8 sub $9,$10,$11 sw $1, 0($12) ALUop Pipelined datapath 42 Read Address Instruction Memory ALU Write Data 4 Add Read Data 1 Read Data 2 Read Reg 1 Read Reg 2 Write Reg Register File inst[25:21] inst[20:16] inst[15:11] inst[31:0] m u x 0 1 m u x 0 1 sign- extend 32 16 Data Memory Address Read Data m u x 1 0 Write Data m u x 1 0 Add Shift left 2 ALUSrc MemtoReg MemRead RegDst RegWrite MemWrite PCSrc Zero IF/ID ID/EX EX/MEM MEM/WB add $1, $2, $3 lw $4, 0($5) sub $6, $7, $8 sub $9,$10,$11 sw $1, 0($12) ALUop Pipelined datapath 43 Read Address Instruction Memory ALU Write Data 4 Add Read Data 1 Read Data 2 Read Reg 1 Read Reg 2 Write Reg Register File inst[25:21] inst[20:16] inst[15:11] inst[31:0] m u x 0 1 m u x 0 1 sign- extend 32 16 Data Memory Address Read Data m u x 1 0 Write Data m u x 1 0 Add Shift left 2 ALUSrc MemtoReg MemRead RegDst RegWrite MemWrite PCSrc Zero IF/ID ID/EX EX/MEM MEM/WB add $1, $2, $3 lw $4, 0($5) sub $6, $7, $8 sub $9,$10,$11 sw $1, 0($12) ALUop RegDst Pipelined datapath 44 Read Address Instruction Memory ALU Write Data 4 Add Read Data 1 Read Data 2 Read Reg 1 Read Reg 2 Write Reg Register File inst[25:21] inst[20:16] inst[15:11] inst[31:0] m u x 0 1 m u x 0 1 sign- extend 32 16 Data Memory Address Read Data m u x 1 0 Write Data m u x 1 0 Add Shift left 2 ALUSrc MemtoReg MemRead RegWrite MemWrite PCSrc Zero IF/ID ID/EX EX/MEM MEM/WB Is this right? add $1, $2, $3 lw $4, 0($5) sub $6, $7, $8 sub $9,$10,$11 sw $1, 0($12) ALUop Pipelined datapath 45 Read Address Instruction Memory ALU Write Data 4 Add Read Data 1 Read Data 2 Read Reg 1 Read Reg 2 Write Reg Register File inst[25:21] inst[20:16] inst[31:0] m u x 0 1 m u x 0 1 sign- extend 32 16 Data Memory Address Read Data m u x 1 0 Write Data m u x 1 0 Add Shift left 2 ALUSrc MemtoReg MemRead RegDst RegWrite MemWrite PCSrc Zero IF/ID ID/EX EX/MEM MEM/WB i n s t [ 1 5 : 1 1 ] ALUop Pipelined datapath + control 46 Read Address Instruction Memory ALU Write Data 4 Add Read Data 1 Read Data 2 Read Reg 1 Read Reg 2 Write Reg Register File inst[25:21] inst[20:16] inst[31:0] m u x 0 1 m u x 0 1 sign- extend 32 16 Data Memory Address Read Data m u x 1 0 Write Data m u x 1 0 Add Shift left 2 ALUSrc MemtoReg MemRead RegDst RegWrite MemWrite PCSrc Zero IF/ID ID/EX EX/MEM MEM/WB i n s t [ 1 5 : 1 1 ] ALUop Control WB ME EX WB ME WB RegWrite Simplified pipeline diagram 1.Use symbols to represent the physical resources with the abbreviations for pipeline stages. 1. IF, ID, EXE, MEM, WB 2.Horizontal axis represent the timeline, vertical axis for the instruction stream 3.Example: 47 add $1, $2, $3 lw $4, 0($5) sub $6, $7, $8 sub $9,$10,$11 sw $1, 0($12) IF EXE WB ID MEM IF EXE WB ID MEM IF EXE ID MEM IF EXE ID IF ID WB WB MEM EXE WB MEM What how much speedup should pipelining provide and why? Letter Answer A 5x, by Amdahls law (x = 0.8 , S = 6.25) B 25x, by the PE, since CPI goes up by 5x and cycle time goes down by 5x C 2.24x, by the PE and the quotient rule D 5x, by the PE since cycle time goes down by 90% E 5x, by the PE since clock rate goes up by 5x 48 50 Pipelining Inaction Ctrl 0.797 ns Imem 2.77 ns ArgBMux 1.124 ns ALU 6.527ns Dmem 1.744 ns ALU 6.527 ns WriteRegMux 3.067 ns RegFile 2.27 ns 53 Ideal 5-stage Pipeline (3.733ns -> 267Mhz) Single-cycle Implementation to scale 18.667ns -> 3.733ns == 80% reduction in CT Lold = IC * CPI * CTold Lnew = IC * CPI * CTnew CTnew = 0.2 * CTold Lnew = 0.2 * Lold Speed up = Lold/Lnew = 5x 54 Ideal 5-stage Pipeline (3.733ns -> 267Mhz) Single-cycle Implementation to scale Realistic 5-stage Pipeline Letter Whats the actual speedup? Clock rate? A 3x; 150Mhz B 1.02; 76.6Mhz C 2.85x; 153Mhz D 5.49x; 294Mhz E None of the abve