Sie sind auf Seite 1von 253

COMPUTER ORGANIZATION AND DESIGN 5th

Edition
The Hardware/Software Interface

Chapter 1
Computer Abstractions
and Technology
Classes of Computers
 Personal computers
 General purpose, variety of software
 Subject to cost/performance tradeoff

 Server computers
 Network based
 High capacity, performance, reliability
 Range from small servers to building sized

Chapter 1 — Computer Abstractions and Technology — 2


Classes of Computers
 Supercomputers
 High-end scientific and engineering
calculations
 Highest capability but represent a small
fraction of the overall computer market

 Embedded computers
 Hidden as components of systems
 Stringent power/performance/cost constraints

Chapter 1 — Computer Abstractions and Technology — 3


What You Will Learn
 How programs are translated into the
machine language
 And how the hardware executes them
 The hardware/software interface
 What determines program performance
 And how it can be improved
 How hardware designers improve
performance
 What is parallel processing
Chapter 1 — Computer Abstractions and Technology — 4
Understanding Performance
 Algorithm
 Determines number of operations executed
 Programming language, compiler, architecture
 Determine number of machine instructions executed
per operation
 Processor and memory system
 Determine how fast instructions are executed
 I/O system (including OS)
 Determines how fast I/O operations are executed

Chapter 1 — Computer Abstractions and Technology — 5


§1.2 Eight Great Ideas in Computer Architecture
Eight Great Ideas
 Design for Moore’s Law

 Use abstraction to simplify design

 Make the common case fast

 Performance via parallelism

 Performance via pipelining

 Performance via prediction

 Hierarchy of memories

Chapter 1 — Computer Abstractions and Technology — 6


§1.3 Below Your Program
Below Your Program
 Application software
 Written in high-level language
 System software
 Compiler: translates HLL code to
machine code
 Operating System: service code
 Handling input/output
 Managing memory and storage
 Scheduling tasks & sharing resources
 Hardware
 Processor, memory, I/O controllers

Chapter 1 — Computer Abstractions and Technology — 7


Levels of Program Code
 High-level language
 Level of abstraction closer
to problem domain
 Provides for productivity
and portability
 Assembly language
 Textual representation of
instructions
 Hardware representation
 Binary digits (bits)
 Encoded instructions and
data

Chapter 1 — Computer Abstractions and Technology — 8


§1.4 Under the Covers
Components of a Computer
The BIG Picture  Same components for
all kinds of computer
 Desktop, server,
embedded
 Input/output includes
 User-interface devices
 Display, keyboard, mouse
 Storage devices
 Hard disk, CD/DVD, flash
 Network adapters
 For communicating with
other computers

Chapter 1 — Computer Abstractions and Technology — 9


Inside the Processor (CPU)
 Datapath: performs operations on data
 Control: sequences datapath, memory, ...
 Cache memory
 Small fast SRAM memory for immediate
access to data

Chapter 1 — Computer Abstractions and Technology — 10


Inside the Processor
 Apple A5 32-SoC

Chapter 1 — Computer Abstractions and Technology — 11


A Safe Place for Data
 Volatile main memory
 Loses instructions and data when power off
 Non-volatile secondary memory
 Magnetic disk
 Flash memory
 Optical disk (CDROM, DVD)

Chapter 1 — Computer Abstractions and Technology — 12


§1.5 Technologies for Building Processors and Memory
Technology Trends
 Electronics
technology
continues to evolve
 Increased capacity
and performance
 Reduced cost
DRAM capacity

Year Technology Relative performance/cost


1951 Vacuum tube 1
1965 Transistor 35
1975 Integrated circuit (IC) 900
1995 Very large scale IC (VLSI) 2,400,000
2013 Ultra large scale IC 250,000,000,000

Chapter 1 — Computer Abstractions and Technology — 13


Manufacturing ICs

 Yield: proportion of working dies per wafer

Chapter 1 — Computer Abstractions and Technology — 14


Intel Core i7 Wafer

 300mm wafer, 280 chips, 32nm technology


 Each chip is 20.7 x 10.5 mm
Chapter 1 — Computer Abstractions and Technology — 15
Integrated Circuit Cost
Cost per wafer
Cost per die 
Dies per wafer  Yield
Dies per wafer  Wafer area Die area
1
Yield 
(1 (Defects per area  Die area/2))2

 Nonlinear relation to area and defect rate


 Wafer cost and area are fixed
 Defect rate determined by manufacturing process
 Die area determined by architecture and circuit design

Chapter 1 — Computer Abstractions and Technology — 16


§1.6 Performance
Defining Performance
 Which airplane has the best performance?

Boeing 777 Boeing 777

Boeing 747 Boeing 747

BAC/Sud BAC/Sud
Concorde Concorde
Douglas Douglas DC-
DC-8-50 8-50

0 100 200 300 400 500 0 2000 4000 6000 8000 10000

Passenger Capacity Cruising Range (miles)

Boeing 777 Boeing 777

Boeing 747 Boeing 747

BAC/Sud BAC/Sud
Concorde Concorde
Douglas Douglas DC-
DC-8-50 8-50

0 500 1000 1500 0 100000 200000 300000 400000

Cruising Speed (mph) Passengers x mph

Chapter 1 — Computer Abstractions and Technology — 17


Response Time and Throughput
 Response time
 How long it takes to do a task
 Throughput
 Total work done per unit time
 e.g., tasks/transactions/… per hour
 How are response time and throughput affected
by
 Replacing the processor with a faster version?
 Adding more processors?
 We’ll focus on response time for now…

Chapter 1 — Computer Abstractions and Technology — 18


Relative Performance
 Define Performance = 1/Execution Time
 “X is n time faster than Y”
Performance X Performance Y
 Execution timeY Execution timeX  n

 Example: time taken to run a program


 10s on A, 15s on B
 Execution TimeB / Execution TimeA
= 15s / 10s = 1.5
 So A is 1.5 times faster than B
Chapter 1 — Computer Abstractions and Technology — 19
Measuring Execution Time
 Elapsed time
 Total response time, including all aspects
 Processing, I/O, OS overhead, idle time
 Determines system performance
 CPU time
 Time spent processing a given job
 Discounts I/O time, other jobs’ shares
 Comprises user CPU time and system CPU
time
 Different programs are affected differently by
CPU and system performance
Chapter 1 — Computer Abstractions and Technology — 20
CPU Clocking
 Operation of digital hardware governed by a
constant-rate clock
Clock period

Clock (cycles)

Data transfer
and computation
Update state

 Clock period: duration of a clock cycle


 e.g., 250ps = 0.25ns = 250×10–12s
 Clock frequency (rate): cycles per second
 e.g., 4.0GHz = 4000MHz = 4.0×109Hz
Chapter 1 — Computer Abstractions and Technology — 21
CPU Time
CPU Time  CPU Clock Cycles  Clock Cycle Time
CPU Clock Cycles

Clock Rate
 Performance improved by
 Reducing number of clock cycles
 Increasing clock rate
 Hardware designer must often trade off clock
rate against cycle count

Chapter 1 — Computer Abstractions and Technology — 22


CPU Time Example
 Computer A: 2GHz clock, 10s CPU time
 Designing Computer B
 Aim for 6s CPU time
 Can do faster clock, but causes 1.2 × clock cycles
 How fast must Computer B clock be?
Clock CyclesB 1.2  Clock Cycles A
Clock RateB  
CPU TimeB 6s
Clock Cycles A  CPU TimeA  Clock Rate A
 10s  2GHz  20  109
1.2  20  109 24  109
Clock RateB    4GHz
6s 6s
Chapter 1 — Computer Abstractions and Technology — 23
Instruction Count and CPI
Clock Cycles  Ins truction Count Cycles per Ins truction
CPU Tim e  Ins truction Count CPI Clock Cycle Tim e
Ins truction Count CPI

ClockRate
 Instruction Count for a program
 Determined by program, ISA and compiler
 Average cycles per instruction
 Determined by CPU hardware
 If different instructions have different CPI
 Average CPI affected by instruction mix

Chapter 1 — Computer Abstractions and Technology — 24


CPI Example
 Computer A: Cycle Time = 250ps, CPI = 2.0
 Computer B: Cycle Time = 500ps, CPI = 1.2
 Same ISA
 Which is faster, and by how much?
CPU Tim e  Ins truction Count CPI  Cycle Tim e
A A A
 I  2.0  250ps  I  500ps A is faster…
CPU Tim e  Ins truction Count CPI  Cycle Tim e
B B B
 I  1.2  500ps  I  600ps

B  I  600ps  1.2
CPU Tim e
…by this much
CPU Tim e I  500ps
A
Chapter 1 — Computer Abstractions and Technology — 25
CPI in More Detail
 If different instruction classes take different
numbers of cycles
n
Clock Cycles   (CPIi  Instruction Counti )
i1

 Weighted average CPI


Clock Cycles n
 Instruction Counti 
CPI     CPIi  
Instruction Count i1  Instruction Count 

Relative frequency

Chapter 1 — Computer Abstractions and Technology — 26


CPI Example
 Alternative compiled code sequences using
instructions in classes A, B, C

Class A B C
CPI for class 1 2 3
IC in sequence 1 2 1 2
IC in sequence 2 4 1 1

 Sequence 1: IC = 5  Sequence 2: IC = 6
 Clock Cycles  Clock Cycles
= 2×1 + 1×2 + 2×3 = 4×1 + 1×2 + 1×3
= 10 =9
 Avg. CPI = 10/5 = 2.0  Avg. CPI = 9/6 = 1.5
Chapter 1 — Computer Abstractions and Technology — 27
Performance Summary
The BIG Picture

Instructions Clock cycles Seconds


CPU Time   
Program Instruction Clock cycle

 Performance depends on
 Algorithm: affects IC, possibly CPI
 Programming language: affects IC, CPI
 Compiler: affects IC, CPI
 Instruction set architecture: affects IC, CPI, Tc

Chapter 1 — Computer Abstractions and Technology — 28


Multiprocessors
 Multicore microprocessors
 More than one processor per chip
 Requires explicitly parallel programming
 Compare with instruction level parallelism
 Hardware executes multiple instructions at once
 Hidden from the programmer
 Hard to do
 Programming for performance
 Load balancing
 Optimizing communication and synchronization

Chapter 1 — Computer Abstractions and Technology — 29


SPEC CPU Benchmark
 Programs used to measure performance
 Supposedly typical of actual workload
 Standard Performance Evaluation Corp (SPEC)
 Develops benchmarks for CPU, I/O, Web, …
 SPEC CPU2006
 Elapsed time to execute a selection of programs
 Negligible I/O, so focuses on CPU performance
 Normalize relative to reference machine
 Summarize as geometric mean of performance ratios
 CINT2006 (integer) and CFP2006 (floating-point)

n
n
 Execution time ratio
i1
i

Chapter 1 — Computer Abstractions and Technology — 30


§1.10 Fallacies and Pitfalls
Pitfall: Amdahl’s Law
 Improving an aspect of a computer and
expecting a proportional improvement in
overall performance
Taffected
Timproved   Tunaffected
improvement factor
 Example: multiply accounts for 80s/100s
 How much improvement in multiply performance to
get 5× overall?
80  Can’t be done!
20   20
n
 Corollary: make the common case fast
Chapter 1 — Computer Abstractions and Technology — 31
Pitfall: MIPS as a Performance Metric
 MIPS: Millions of Instructions Per Second
 Doesn’t account for
 Differences in ISAs between computers
 Differences in complexity between instructions

Instruction count
MIPS 
Execution time  106
Instruction count Clock rate
 
Instruction count  CPI CPI  10 6
 10 6

Clock rate
 CPI varies between programs on a given CPU
Chapter 1 — Computer Abstractions and Technology — 32
§1.9 Concluding Remarks
Concluding Remarks
 Cost/performance is improving
 Due to underlying technology development
 Hierarchical layers of abstraction
 In both hardware and software
 Instruction set architecture
 The hardware/software interface
 Execution time: the best performance
measure
 Power is a limiting factor
 Use parallelism to improve performance
Chapter 1 — Computer Abstractions and Technology — 33
Chapter 2
Instructions: Language
of the Computer
HW#1:
1.3 all, 1.4 all, 1.6.1, 1.14.4, 1.14.5, 1.14.6, 1.15.1, and 1.15.4
Due date: one week.

Practice:
1.5 all, 1.6 all, 1.10 all, 1.11 all, 1.14 all, and1.15 all
§2.1 Introduction
Instruction Set
 The repertoire of instructions of a
computer
 Different computers have different
instruction sets
 But with many aspects in common
 Early computers had very simple
instruction sets
 Simplified implementation
 Many modern computers also have simple
instruction sets

Dr. Yahya Tashtoush


Two Key Principles of Machine Design
1. Instructions are represented as numbers and, as
such, are indistinguishable from data
2. Programs are stored in alterable memory (that can
Memory
be read or written to)
just like data Accounting prg
(machine code)
 Stored-program concept
C compiler
 Programs can be shipped as files
(machine code)
of binary numbers – binary
compatibility
Payroll
 Computers can inherit ready-made data
software provided they are
compatible with an existing ISA – Source code in
leads industry to align around a C for Acct prg
small number of ISAs

Dr. Yahya Tashtoush


MIPS-32 ISA
Registers
 Instruction Categories
 Computational
R0 - R31
 Load/Store
 Jump and Branch
 Floating Point
 coprocessor PC
 Memory Management HI
 Special LO

3 Instruction Formats: all 32 bits wide

op rs rt rd sa funct R format

op rs rt immediate I format

op jump target J format


Dr. Yahya Tashtoush
MIPS (RISC) Design Principles
 Simplicity favors regularity
 fixed size instructions
 small number of instruction formats
 opcode always the first 6 bits
 Smaller is faster
 limited instruction set
 limited number of registers in register file
 limited number of addressing modes
 Make the common case fast
 arithmetic operands from the register file (load-store
machine)
 allow instructions to contain immediate operands
 Good design demands good compromises
 three instruction formats
Dr. Yahya Tashtoush
Instructions
 MIPS assembly language arithmetic statement
add $t0, $s1, $s2
sub $t0, $s1, $s2

 Each arithmetic instruction performs one operation


 Each specifies exactly three operands that are all
contained in the datapath’s register file ($t0,$s1,$s2)
destination  source1 op source2

 Instruction Format (R format)

0 17 18 8 0 0x22

Dr. Yahya Tashtoush


MIPS Instruction Fields
 MIPS fields are given names to make
them easier to refer to
op rs rt rd shamt funct

op 6-bits opcode that specifies the operation


rs 5-bits register file address of the first source operand
rt 5-bits register file address of the second source operand
rd 5-bits register file address of the result’s destination
shamt 5-bits shift amount (for shift instructions)
funct 6-bits function code augmenting the opcode

Dr. Yahya Tashtoush


MIPS Register File
Register File
32 bits
 Holds thirty-two 32-bit registers 5
src1 addr 32 src1
 Two read ports and data
5
 One write port src2 addr 32
5 locations
 Registers are dst addr
32 src2
 Faster than main memory 32
write data data
- But register files with more locations
are slower (e.g., a 64 word file could
write control
be as much as 50% slower than a 32 word file)
- Read/write port increase impacts speed quadratically
 Easier for a compiler to use
- e.g., (A*B) – (C*D) – (E*F) can do multiplies in any order vs.
stack
 Can hold variables so that
- code density improves (since register are named with fewer bits
than a memory location)

Dr. Yahya Tashtoush


Convention
Name Register Usage Preserve
Number on call?
$zero 0 constant 0 (hardware) n.a.
$at 1 reserved for assembler n.a.
$v0 - $v1 2-3 returned values no
$a0 - $a3 4-7 arguments yes
$t0 - $t7 8-15 temporaries no
$s0 - $s7 16-23 saved values yes
$t8 - $t9 24-25 temporaries no
$gp 28 global pointer yes
$sp 29 stack pointer yes
$fp 30 frame pointer yes
$ra 31 return addr (hardware) yes
§2.2 Operations of the Computer Hardware
Arithmetic Operations
 Add and subtract, three operands
 Two sources and one destination
add a, b, c # a gets b + c
 All arithmetic operations have this form
 Design Principle 1: Simplicity favours
regularity
 Regularity makes implementation simpler
 Simplicity enables higher performance at
lower cost

Dr. Yahya Tashtoush


Arithmetic Example
 C code:
f = (g + h) - (i + j);

 Compiled MIPS code:


add t0, g, h # temp t0 = g + h
add t1, i, j # temp t1 = i + j
sub f, t0, t1 # f = t0 - t1

Dr. Yahya Tashtoush


§2.3 Operands of the Computer Hardware
Register Operands
 Arithmetic instructions use register
operands
 MIPS has a 32 × 32-bit register file
 Use for frequently accessed data
 Numbered 0 to 31
 32-bit data called a “word”
 Assembler names
 $t0, $t1, …, $t9 for temporary values
 $s0, $s1, …, $s7 for saved variables
 Design Principle 2: Smaller is faster
 c.f. main memory: millions of locations

Dr. Yahya Tashtoush


Register Operand Example
 C code:
f = (g + h) - (i + j);
 f, …, j in $s0, …, $s4

 Compiled MIPS code:


add $t0, $s1, $s2
add $t1, $s3, $s4
sub $s0, $t0, $t1

Dr. Yahya Tashtoush


Memory Operands
 Main memory used for composite data
 Arrays, structures, dynamic data
 To apply arithmetic operations
 Load values from memory into registers
 Store result from register to memory
 Memory is byte addressed
 Each address identifies an 8-bit byte
 Words are aligned in memory
 Address must be a multiple of 4
 MIPS is Big Endian
 Most-significant byte at least address of a word
 c.f. Little Endian: least-significant byte at least address

Dr. Yahya Tashtoush


Machine Language - Load Instruction
 Load/Store Instruction Format (I format):
lw $t0, 24($s3)

35 19 8 2410

Memory
2410 + $s3 = 0xf f f f f f f f

. . . 0001 1000 0x120040ac


$t0
+ . . . 1001 0100
. . . 1010 1100 = $s3 0x12004094

0x120040ac
0x0000000c
0x00000008
0x00000004
0x00000000
Dr. Yahya Tashtoush
data word address (hex)
Byte Addresses
 Since 8-bit bytes are so useful, most architectures
address individual bytes in memory
 Alignment restriction - the memory address of a

word must be on natural word boundaries (a


multiple of 4 in MIPS-32)
 Big Endian: leftmost byte is word address
IBM 360/370, Motorola 68k, MIPS, Sparc, HP PA
 Little Endian: rightmost byte is word address
Intel 80x86, DEC Vax, DEC Alpha (Windows NT)
little endian byte 0
3 2 1 0
msb lsb
0 1 2 3
Big Endian byte 0
Dr. Yahya Tashtoush
Memory Operand Example 1
 C code:
g = h + A[8];
 g in $s1, h in $s2, base address of A in $s3

 Compiled MIPS code:


 Index 8 requires offset of 32
 4 bytes per word
lw $t0, 32($s3) # load word
add $s1, $s2, $t0
offset base register

Dr. Yahya Tashtoush


Memory Operand Example 2
 C code:
A[12] = h + A[8];
 h in $s2, base address of A in $s3

 Compiled MIPS code:


 Index 8 requires offset of 32
lw $t0, 32($s3) # load word
add $t0, $s2, $t0
sw $t0, 48($s3) # store word

Dr. Yahya Tashtoush


Registers vs. Memory
 Registers are faster to access than
memory
 Operating on memory data requires loads
and stores
 More instructions to be executed
 Compiler must use registers for variables
as much as possible
 Only spill to memory for less frequently used
variables
 Register optimization is important!

Dr. Yahya Tashtoush


Immediate Operands
 Constant data specified in an instruction
addi $s3, $s3, 4
 No subtract immediate instruction
 Just use a negative constant
addi $s2, $s1, -1
 Design Principle 3: Make the common
case fast
 Small constants are common
 Immediate operand avoids a load instruction

Dr. Yahya Tashtoush


The Constant Zero
 MIPS register 0 ($zero) is the constant 0
 Cannot be overwritten
 Useful for common operations
 E.g., move between registers
add $t2, $s1, $zero

Dr. Yahya Tashtoush


Review: Unsigned Binary Representation

Hex Binary Decimal


0x00000000 0…0000 0
0x00000001 0…0001 1 231 230 229 ... 23 22 21 20 bit weight
0x00000002 0…0010 2
31 30 29 ... 3 2 1 0 bit position
0x00000003 0…0011 3
0x00000004 0…0100 4 1 1 1 ... 1 1 1 1 bit
0x00000005 0…0101 5
0x00000006 0…0110 6
1 0 0 0 ... 0 0 0 0 - 1
0x00000007 0…0111 7
0x00000008 0…1000 8
0x00000009 0…1001 9
… 232 - 1
0xFFFFFFFC 1…1100 232 - 4
0xFFFFFFFD 1…1101 232 - 3
0xFFFFFFFE 1…1110 232 - 2
0xFFFFFFFF 1…1111 232 - 1
Review: Signed Binary Representation
2’sc binary decimal
-23 = 1000 -8
-(23 - 1) = 1001 -7
1010 -6
1011 -5
complement all the bits 1100 -4

1011 1101 -3
0101
1110 -2
and add a 1 1111 -1
and add a 1
0000 0
1010 0001 1
0110
0010 2
complement all the bits 0011 3
0100 4
0101 5
0110 6
23 - 1 = 0111 7
2s-Complement Signed Integers
 Given an n-bit number
n1 n2
x  xn12  x n 2 2    x12  x0 2
1 0

 Range: –2n – 1 to +2n – 1 – 1


 Example
 1111 1111 1111 1111 1111 1111 1111 11002
= –1×231 + 1×230 + … + 1×22 +0×21 +0×20
= –2,147,483,648 + 2,147,483,644 = –410
 Using 32 bits
 –2,147,483,648 to +2,147,483,647

Dr. Yahya Tashtoush


2s-Complement Signed Integers
 Bit 31 is sign bit
 1 for negative numbers
 0 for non-negative numbers
 –(–2n – 1) can’t be represented
 Non-negative numbers have the same unsigned
and 2s-complement representation
 Some specific numbers
 0: 0000 0000 … 0000
 –1: 1111 1111 … 1111
 Most-negative: 1000 0000 … 0000
 Most-positive: 0111 1111 … 1111

Dr. Yahya Tashtoush


Signed Negation
 Complement and add 1
 Complement means 1 → 0, 0 → 1

x  x  1111...1112  1

x  1  x

 Example: negate +2
 +2 = 0000 0000 … 00102
 –2 = 1111 1111 … 11012 + 1
= 1111 1111 … 11102

Dr. Yahya Tashtoush


Sign Extension
 Representing a number using more bits
 Preserve the numeric value
 In MIPS instruction set
 addi: extend immediate value
 lb, lh: extend loaded byte/halfword
 beq, bne: extend the displacement
 Replicate the sign bit to the left
 c.f. unsigned values: extend with 0s
 Examples: 8-bit to 16-bit
 +2: 0000 0010 => 0000 0000 0000 0010
 –2: 1111 1110 => 1111 1111 1111 1110

Dr. Yahya Tashtoush


§2.5 Representing Instructions in the Computer
Representing Instructions
 Instructions are encoded in binary
 Called machine code
 MIPS instructions
 Encoded as 32-bit instruction words
 Small number of formats encoding operation code
(opcode), register numbers, …
 Regularity!
 Register numbers
 $t0 – $t7 are reg’s 8 – 15
 $t8 – $t9 are reg’s 24 – 25
 $s0 – $s7 are reg’s 16 – 23

Dr. Yahya Tashtoush


MIPS R-format Instructions
op rs rt rd shamt funct
6 bits 5 bits 5 bits 5 bits 5 bits 6 bits

 Instruction fields
 op: operation code (opcode)
 rs: first source register number
 rt: second source register number
 rd: destination register number
 shamt: shift amount (00000 for now)
 funct: function code (extends opcode)

Dr. Yahya Tashtoush


R-format Example
op rs rt rd shamt funct
6 bits 5 bits 5 bits 5 bits 5 bits 6 bits

add $t0, $s1, $s2


special $s1 $s2 $t0 0 add

0 17 18 8 0 32

000000 10001 10010 01000 00000 100000

000000100011001001000000001000002 = 0232402016

Dr. Yahya Tashtoush


Hexadecimal
 Base 16
 Compact representation of bit strings
 4 bits per hex digit

0 0000 4 0100 8 1000 c 1100


1 0001 5 0101 9 1001 d 1101
2 0010 6 0110 a 1010 e 1110
3 0011 7 0111 b 1011 f 1111

 Example: eca8 6420


 1110 1100 1010 1000 0110 0100 0010 0000

Dr. Yahya Tashtoush


MIPS I-format Instructions
op rs rt constant or address
6 bits 5 bits 5 bits 16 bits

 Immediate arithmetic and load/store instructions


 rt: destination or source register number
 Constant: –215 to +215 – 1
 Address: offset added to base address in rs
 Design Principle 4: Good design demands good
compromises
 Different formats complicate decoding, but allow 32-bit
instructions uniformly
 Keep formats as similar as possible

Dr. Yahya Tashtoush


Stored Program Computers
The BIG Picture  Instructions represented in
binary, just like data
 Instructions and data stored
in memory
 Programs can operate on
programs
 e.g., compilers, linkers, …
 Binary compatibility allows
compiled programs to work
on different computers
 Standardized ISAs

Dr. Yahya Tashtoush


§2.6 Logical Operations
Logical Operations
 Instructions for bitwise manipulation
Operation C Java MIPS
Shift left << << sll
Shift right >> >>> srl
Bitwise AND & & and, andi
Bitwise OR | | or, ori
Bitwise NOT ~ ~ nor

 Useful for extracting and inserting


groups of bits in a word
Dr. Yahya Tashtoush
Shift Operations
op rs rt rd shamt funct
6 bits 5 bits 5 bits 5 bits 5 bits 6 bits

 shamt: how many positions to shift


 Shift left logical
 Shift left and fill with 0 bits
 sll by i bits multiplies by 2i
 Shift right logical
 Shift right and fill with 0 bits
 srl by i bits divides by 2i (unsigned only)

Dr. Yahya Tashtoush


AND Operations
 Useful to mask bits in a word
 Select some bits, clear others to 0
and $t0, $t1, $t2

$t2 0000 0000 0000 0000 0000 1101 1100 0000

$t1 0000 0000 0000 0000 0011 1100 0000 0000

$t0 0000 0000 0000 0000 0000 1100 0000 0000

Dr. Yahya Tashtoush


OR Operations
 Useful to include bits in a word
 Set some bits to 1, leave others unchanged
or $t0, $t1, $t2

$t2 0000 0000 0000 0000 0000 1101 1100 0000

$t1 0000 0000 0000 0000 0011 1100 0000 0000

$t0 0000 0000 0000 0000 0011 1101 1100 0000

Dr. Yahya Tashtoush


NOT Operations
 Useful to invert bits in a word
 Change 0 to 1, and 1 to 0
 MIPS has NOR 3-operand instruction
 a NOR b == NOT ( a OR b )
nor $t0, $t1, $zero Register 0: always
read as zero

$t1 0000 0000 0000 0000 0011 1100 0000 0000

$t0 1111 1111 1111 1111 1100 0011 1111 1111

Dr. Yahya Tashtoush


§2.7 Instructions for Making Decisions
Conditional Operations
 Branch to a labeled instruction if a
condition is true
 Otherwise, continue sequentially
 beq rs, rt, L1
 if (rs == rt) branch to instruction labeled L1;
 bne rs, rt, L1
 if (rs != rt) branch to instruction labeled L1;
 j L1
 unconditional jump to instruction labeled L1

Dr. Yahya Tashtoush


Specifying Branch Destinations
 Use a register (like in lw and sw) added to the 16-bit offset
 which register? Instruction Address Register (the PC)
 its use is automatically implied by instruction
 PC gets updated (PC+4) during the fetch cycle so that it holds the
address of the next instruction
 limits the branch distance to -215 to +215-1 (word) instructions from
the (instruction after the) branch instruction, but most branches are
local anyway
from the low order 16 bits of the branch instruction
16
offset
sign-extend
00
branch dst
32 32 Add address
PC 32 Add 32
32
32 4 32 ?

Dr. Yahya Tashtoush


Compiling If Statements
 C code:
if (i==j) f = g+h;
else f = g-h;
 f, g, … in $s0, $s1, …
 Compiled MIPS code:
bne $s3, $s4, Else
add $s0, $s1, $s2
j Exit
Else: sub $s0, $s1, $s2
Exit: …
Assembler calculates addresses

Dr. Yahya Tashtoush


Compiling Loop Statements
 C code:
while (save[i] == k) i += 1;
 i in $s3, k in $s5, address of save in $s6
 Compiled MIPS code:
Loop: sll $t1, $s3, 2
add $t1, $t1, $s6
lw $t0, 0($t1)
bne $t0, $s5, Exit
addi $s3, $s3, 1
j Loop
Exit: …

Dr. Yahya Tashtoush


More Conditional Operations
 Set result to 1 if a condition is true
 Otherwise, set to 0
 slt rd, rs, rt
 if (rs < rt) rd = 1; else rd = 0;
 slti rt, rs, constant
 if (rs < constant) rt = 1; else rt = 0;
 Use in combination with beq, bne
slt $t0, $s1, $s2 # if ($s1 < $s2)
bne $t0, $zero, L # branch to L

Dr. Yahya Tashtoush


Branch Instruction Design
 Why not blt, bge, etc?
 Hardware for <, ≥, … slower than =, ≠
 Combining with branch involves more work
per instruction, requiring a slower clock
 All instructions penalized!
 beq and bne are the common case
 This is a good design compromise

Dr. Yahya Tashtoush


Signed vs. Unsigned
 Signed comparison: slt, slti
 Unsigned comparison: sltu, sltui
 Example
 $s0 = 1111 1111 1111 1111 1111 1111 1111 1111
 $s1 = 0000 0000 0000 0000 0000 0000 0000 0001
 slt $t0, $s0, $s1 # signed
 –1 < +1  $t0 = 1
 sltu $t0, $s0, $s1 # unsigned
 +4,294,967,295 > +1  $t0 = 0

Dr. Yahya Tashtoush


Register Usage
 $a0 – $a3: arguments (reg’s 4 – 7)
 $v0, $v1: result values (reg’s 2 and 3)
 $t0 – $t9: temporaries
 Can be overwritten by callee
 $s0 – $s7: saved
 Must be saved/restored by callee
 $gp: global pointer for static data (reg 28)
 $sp: stack pointer (reg 29)
 $fp: frame pointer (reg 30)
 $ra: return address (reg 31)

Dr. Yahya Tashtoush


Procedure Call Instructions
 Procedure call: jump and link
jal ProcedureLabel
 Address of following instruction put in $ra

 Jumps to target address

 Procedure return: jump register


jr $ra
 Copies $ra to program counter

 Can also be used for computed jumps

 e.g., for case/switch statements

Dr. Yahya Tashtoush


§2.10 MIPS Addressing for 32-Bit Immediates and Addresses
32-bit Constants
 Most constants are small
 16-bit immediate is sufficient
 For the occasional 32-bit constant
lui rt, constant
 Copies 16-bit constant to left 16 bits of rt
 Clears right 16 bits of rt to 0

lhi $s0, 61 0000 0000 0111 1101 0000 0000 0000 0000

ori $s0, $s0, 2304 0000 0000 0111 1101 0000 1001 0000 0000

Dr. Yahya Tashtoush


Branch Addressing
 Branch instructions specify
 Opcode, two registers, target address
 Most branch targets are near branch
 Forward or backward

op rs rt constant or address
6 bits 5 bits 5 bits 16 bits

 PC-relative addressing
 Target address = PC + offset × 4
 PC already incremented by 4 by this time
Dr. Yahya Tashtoush
Jump Control Flow Instructions
 MIPS also has an unconditional branch instruction or
jump instruction:
j label #go to label

 Instruction Format (J Format):


0x02 26-bit address

from the low order 26 bits of the jump instruction


26

00

32
4
PC 32

Dr. Yahya Tashtoush


Target Addressing Example
 Loop code from earlier example
 Assume Loop at location 80000

Loop: sll $t1, $s3, 2 80000 0 0 19 9 4 0


add $t1, $t1, $s6 80004 0 9 22 9 0 32
lw $t0, 0($t1) 80008 35 9 8 0
bne $t0, $s5, Exit 80012 5 8 21 2
addi $s3, $s3, 1 80016 8 19 19 1
j Loop 80020 2 20000
Exit: … 80024

Dr. Yahya Tashtoush


Branching Far Away
 If branch target is too far to encode with
16-bit offset, assembler rewrites the code
 Example
beq $s0,$s1, L1

bne $s0,$s1, L2
j L1
L2: …

Dr. Yahya Tashtoush


Addressing Mode Summary

Dr. Yahya Tashtoush


§2.12 Translating and Starting a Program
Translation and Startup

Many compilers produce


object modules directly

Static linking

Dr. Yahya Tashtoush


Assembler Pseudoinstructions
 Most assembler instructions represent
machine instructions one-to-one
 Pseudoinstructions: figments of the
assembler’s imagination
move $t0, $t1 → add $t0, $zero, $t1
blt $t0, $t1, L → slt $at, $t0, $t1
bne $at, $zero, L

Dr. Yahya Tashtoush


§2.13 A C Sort Example to Put It All Together
C Sort Example
 Illustrates use of assembly instructions
for a C bubble sort function
 Swap procedure (leaf)
void swap(int v[], int k)
{
int temp;
temp = v[k];
v[k] = v[k+1];
v[k+1] = temp;
}
 v in $a0, k in $a1, temp in $t0

Dr. Yahya Tashtoush


The Procedure Swap
swap: sll $t1, $a1, 2 # $t1 = k * 4
add $t1, $a0, $t1 # $t1 = v+(k*4)
# (address of v[k])
lw $t0, 0($t1) # $t0 (temp) = v[k]
lw $t2, 4($t1) # $t2 = v[k+1]
sw $t2, 0($t1) # v[k] = $t2 (v[k+1])
sw $t0, 4($t1) # v[k+1] = $t0 (temp)
jr $ra # return to calling routine

Dr. Yahya Tashtoush


§2.17 Real Stuff: x86 Instructions
The Intel x86 ISA
 Evolution with backward compatibility
 8080 (1974): 8-bit microprocessor
 Accumulator, plus 3 index-register pairs
 8086 (1978): 16-bit extension to 8080
 Complex instruction set (CISC)
 8087 (1980): floating-point coprocessor
 Adds FP instructions and register stack
 80286 (1982): 24-bit addresses, MMU
 Segmented memory mapping and protection
 80386 (1985): 32-bit extension (now IA-32)
 Additional addressing modes and operations
 Paged memory mapping as well as segments

Dr. Yahya Tashtoush


The Intel x86 ISA
 Further evolution…
 i486 (1989): pipelined, on-chip caches and FPU
 Compatible competitors: AMD, Cyrix, …
 Pentium (1993): superscalar, 64-bit datapath
 Later versions added MMX (Multi-Media eXtension)
instructions
 The infamous FDIV bug
 Pentium Pro (1995), Pentium II (1997)
 New microarchitecture (see Colwell, The Pentium Chronicles)
 Pentium III (1999)
 Added SSE (Streaming SIMD Extensions) and associated
registers
 Pentium 4 (2001)
 New microarchitecture
 Added SSE2 instructions

Dr. Yahya Tashtoush


The Intel x86 ISA
 And further…
 AMD64 (2003): extended architecture to 64 bits
 EM64T – Extended Memory 64 Technology (2004)
 AMD64 adopted by Intel (with refinements)
 Added SSE3 instructions
 Intel Core (2006)
 Added SSE4 instructions, virtual machine support
 AMD64 (announced 2007): SSE5 instructions
 Intel declined to follow, instead…
 Advanced Vector Extension (announced 2008)
 Longer SSE registers, more instructions
 If Intel didn’t extend with compatibility, its
competitors would!
 Technical elegance ≠ market success

Dr. Yahya Tashtoush


Basic x86 Registers

Dr. Yahya Tashtoush


x86 Instruction Encoding
 Variable length
encoding
 Postfix bytes specify
addressing mode
 Prefix bytes modify
operation
 Operand length,
repetition, locking, …

Dr. Yahya Tashtoush


Pitfalls
 Sequential words are not at sequential
addresses
 Increment by 4, not by 1!

Dr. Yahya Tashtoush


§2.19 Concluding Remarks
Concluding Remarks
 Design principles
1. Simplicity favors regularity
2. Smaller is faster
3. Make the common case fast
4. Good design demands good compromises

Dr. Yahya Tashtoush


Concluding Remarks
 Measure MIPS instruction executions in
benchmark programs
 Consider making the common case fast
 Consider compromises
Instruction class MIPS examples SPEC2006 Int SPEC2006 FP
Arithmetic add, sub, addi 16% 48%
Data transfer lw, sw, lb, lbu, 35% 36%
lh, lhu, sb, lui
Logical and, or, nor, andi, 12% 4%
ori, sll, srl
Cond. Branch beq, bne, slt, 34% 8%
slti, sltiu
Jump j, jr, jal 2% 0%

Dr. Yahya Tashtoush


MIPS Organization So Far
Processor
Memory
Register File
1…1100
src1 addr src1
5 32 data
src2 addr 32
5 registers
dst addr ($zero - $ra) read/write
5 src2
write data addr
32 32 data 230
32 words
32 bits
branch offset
32 Add
read data
32 PC 32 Add 32 32
32
4 write data 0…1100
Fetch 32
PC = PC+4 32 0…1000
4 5 6 7 0…0100
32 ALU 0 1 2 3
Exec Decode 0…0000
32
32 bits word address
32
(binary)
byte address
(big Endian)

Dr. Yahya Tashtoush


Chapter 4
The Processor
§4.1 Introduction
Introduction
 CPU performance factors
 Instruction count
 Determined by ISA and compiler
 CPI and Cycle time
 Determined by CPU hardware
 We will examine two MIPS implementations
 A simplified version
 A more realistic pipelined version
 Simple subset, shows most aspects
 Memory reference: lw, sw
 Arithmetic/logical: add, sub, and, or, slt
 Control transfer: beq, j

Chapter 4 — The Processor — 2


Instruction Execution
 PC  instruction memory, fetch instruction
 Register numbers  register file, read registers
 Depending on instruction class
 Use ALU to calculate
 Arithmetic result
 Memory address for load/store
 Branch target address
 Access data memory for load/store
 PC  target address or PC + 4

Chapter 4 — The Processor — 3


CPU Overview

Chapter 4 — The Processor — 4


Multiplexers
 Can’t just join
wires together
 Use multiplexers

Chapter 4 — The Processor — 5


Control

Chapter 4 — The Processor — 6


§4.2 Logic Design Conventions
Logic Design Basics
 Information encoded in binary
 Low voltage = 0, High voltage = 1
 One wire per bit
 Multi-bit data encoded on multi-wire buses
 Combinational element
 Operate on data
 Output is a function of input
 State (sequential) elements
 Store information

Chapter 4 — The Processor — 7


Combinational Elements
 AND-gate  Adder A
Y
+
 Y=A&B  Y=A+B B

A
Y
B

 Arithmetic/Logic Unit
 Multiplexer  Y = F(A, B)
 Y = S ? I1 : I0
A
I0 M
u Y ALU Y
I1 x
B
S F

Chapter 4 — The Processor — 8


Sequential Elements
 Register: stores data in a circuit
 Uses a clock signal to determine when to
update the stored value
 Edge-triggered: update when Clk changes
from 0 to 1

Clk
D Q
D

Clk
Q

Chapter 4 — The Processor — 9


Sequential Elements
 Register with write control
 Only updates on clock edge when write
control input is 1
 Used when stored value is required later

Clk

D Q Write

Write D
Clk
Q

Chapter 4 — The Processor — 10


Clocking Methodology
 Combinational logic transforms data during
clock cycles
 Between clock edges
 Input from state elements, output to state
element
 Longest delay determines clock period

Chapter 4 — The Processor — 11


§4.3 Building a Datapath
Building a Datapath
 Datapath
 Elements that process data and addresses
in the CPU
 Registers, ALUs, mux’s, memories, …
 We will build a MIPS datapath
incrementally
 Refining the overview design

Chapter 4 — The Processor — 12


Instruction Fetch

Increment by
4 for next
32-bit instruction
register

Chapter 4 — The Processor — 13


R-Format Instructions
 Read two register operands
 Perform arithmetic/logical operation
 Write register result

Chapter 4 — The Processor — 14


Load/Store Instructions
 Read register operands
 Calculate address using 16-bit offset
 Use ALU, but sign-extend offset
 Load: Read memory and update register
 Store: Write register value to memory

Chapter 4 — The Processor — 15


Branch Instructions
 Read register operands
 Compare operands
 Use ALU, subtract and check Zero output
 Calculate target address
 Sign-extend displacement
 Shift left 2 places (word displacement)
 Add to PC + 4
 Already calculated by instruction fetch

Chapter 4 — The Processor — 16


Branch Instructions
Just
re-routes
wires

Sign-bit wire
replicated

Chapter 4 — The Processor — 17


Composing the Elements
 First-cut data path does an instruction in
one clock cycle
 Each datapath element can only do one
function at a time
 Hence, we need separate instruction and data
memories
 Use multiplexers where alternate data
sources are used for different instructions

Chapter 4 — The Processor — 18


R-Type/Load/Store Datapath

Chapter 4 — The Processor — 19


Full Datapath

Chapter 4 — The Processor — 20


§4.4 A Simple Implementation Scheme
ALU Control
 ALU used for
 Load/Store: F = add
 Branch: F = subtract
 R-type: F depends on funct field
ALU control Function
0000 AND
0001 OR
0010 add
0110 subtract
0111 set-on-less-than
1100 NOR

Chapter 4 — The Processor — 21


ALU Control
 Assume 2-bit ALUOp derived from opcode
 Combinational logic derives ALU control

opcode ALUOp Operation funct ALU function ALU control


lw 00 load word XXXXXX add 0010
sw 00 store word XXXXXX add 0010
beq 01 branch equal XXXXXX subtract 0110
R-type 10 add 100000 add 0010
subtract 100010 subtract 0110
AND 100100 AND 0000
OR 100101 OR 0001
set-on-less-than 101010 set-on-less-than 0111

Chapter 4 — The Processor — 22


The Main Control Unit
 Control signals derived from instruction

R-type 0 rs rt rd shamt funct


31:26 25:21 20:16 15:11 10:6 5:0

Load/
35 or 43 rs rt address
Store
31:26 25:21 20:16 15:0

Branch 4 rs rt address
31:26 25:21 20:16 15:0

opcode always read, write for sign-extend


read except R-type and add
for load and load

Chapter 4 — The Processor — 23


Datapath With Control

Chapter 4 — The Processor — 24


R-Type Instruction

Chapter 4 — The Processor — 25


Load Instruction

Chapter 4 — The Processor — 26


Branch-on-Equal Instruction

Chapter 4 — The Processor — 27


Implementing Jumps
Jump 2 address
31:26 25:0

 Jump uses word address


 Update PC with concatenation of
 Top 4 bits of old PC
 26-bit jump address
 00
 Need an extra control signal decoded from
opcode
Chapter 4 — The Processor — 28
Datapath With Jumps Added

Chapter 4 — The Processor — 29


Performance Issues
 Longest delay determines clock period
 Critical path: load instruction
 Instruction memory  register file  ALU 
data memory  register file
 Not feasible to vary period for different
instructions
 Violates design principle
 Making the common case fast
 We will improve performance by pipelining

Chapter 4 — The Processor — 30


§4.5 An Overview of Pipelining
Pipelining Analogy
 Pipelined laundry: overlapping execution
 Parallelism improves performance

 Four loads:
 Speedup
= 8/3.5 = 2.3
 Non-stop:
 Speedup
= 2n/0.5n + 1.5 ≈ 4
= number of stages

Chapter 4 — The Processor — 31


MIPS Pipeline
 Five stages, one step per stage
1. IF: Instruction fetch from memory
2. ID: Instruction decode & register read
3. EX: Execute operation or calculate address
4. MEM: Access memory operand
5. WB: Write result back to register

Chapter 4 — The Processor — 32


Pipeline Performance
 Assume time for stages is
 100ps for register read or write
 200ps for other stages
 Compare pipelined datapath with single-cycle
datapath

Instr Instr fetch Register ALU op Memory Register Total time


read access write
lw 200ps 100 ps 200ps 200ps 100 ps 800ps
sw 200ps 100 ps 200ps 200ps 700ps
R-format 200ps 100 ps 200ps 100 ps 600ps
beq 200ps 100 ps 200ps 500ps

Chapter 4 — The Processor — 33


Pipeline Performance
Single-cycle (Tc= 800ps)

Pipelined (Tc= 200ps)

Chapter 4 — The Processor — 34


Pipeline Speedup
 If all stages are balanced
 i.e., all take the same time
 Time between instructionspipelined
= Time between instructionsnonpipelined
Number of stages
 If not balanced, speedup is less
 Speedup due to increased throughput
 Latency (time for each instruction) does not
decrease

Chapter 4 — The Processor — 35


Pipelining and ISA Design
 MIPS ISA designed for pipelining
 All instructions are 32-bits
 Easier to fetch and decode in one cycle
 c.f. x86: 1- to 17-byte instructions
 Few and regular instruction formats
 Can decode and read registers in one step
 Load/store addressing
 Can calculate address in 3rd stage, access memory
in 4th stage
 Alignment of memory operands
 Memory access takes only one cycle

Chapter 4 — The Processor — 36


Hazards
 Situations that prevent starting the next
instruction in the next cycle
 Structure hazards
 A required resource is busy
 Data hazard
 Need to wait for previous instruction to
complete its data read/write
 Control hazard
 Deciding on control action depends on
previous instruction

Chapter 4 — The Processor — 37


Structure Hazards
 Conflict for use of a resource
 In MIPS pipeline with a single memory
 Load/store requires data access
 Instruction fetch would have to stall for that
cycle
 Would cause a pipeline “bubble”
 Hence, pipelined datapaths require
separate instruction/data memories
 Or separate instruction/data caches

Chapter 4 — The Processor — 38


Data Hazards
 An instruction depends on completion of
data access by a previous instruction
 add $s0, $t0, $t1
sub $t2, $s0, $t3

Chapter 4 — The Processor — 39


Forwarding (aka Bypassing)
 Use result when it is computed
 Don’t wait for it to be stored in a register
 Requires extra connections in the datapath

Chapter 4 — The Processor — 40


Load-Use Data Hazard
 Can’t always avoid stalls by forwarding
 If value not computed when needed
 Can’t forward backward in time!

Chapter 4 — The Processor — 41


Code Scheduling to Avoid Stalls
 Reorder code to avoid use of load result in
the next instruction
 C code for A = B + E; C = B + F;

lw $t1, 0($t0) lw $t1, 0($t0)


lw $t2, 4($t0) lw $t2, 4($t0)
stall add $t3, $t1, $t2 lw $t4, 8($t0)
sw $t3, 12($t0) add $t3, $t1, $t2
lw $t4, 8($t0) sw $t3, 12($t0)
stall add $t5, $t1, $t4 add $t5, $t1, $t4
sw $t5, 16($t0) sw $t5, 16($t0)
13 cycles 11 cycles

Chapter 4 — The Processor — 42


Control Hazards
 Branch determines flow of control
 Fetching next instruction depends on branch
outcome
 Pipeline can’t always fetch correct instruction
 Still working on ID stage of branch
 In MIPS pipeline
 Need to compare registers and compute
target early in the pipeline
 Add hardware to do it in ID stage

Chapter 4 — The Processor — 43


Stall on Branch
 Wait until branch outcome determined
before fetching next instruction

Chapter 4 — The Processor — 44


Branch Prediction
 Longer pipelines can’t readily determine
branch outcome early
 Stall penalty becomes unacceptable
 Predict outcome of branch
 Only stall if prediction is wrong
 In MIPS pipeline
 Can predict branches not taken
 Fetch instruction after branch, with no delay

Chapter 4 — The Processor — 45


MIPS with Predict Not Taken

Prediction
correct

Prediction
incorrect

Chapter 4 — The Processor — 46


More-Realistic Branch Prediction
 Static branch prediction
 Based on typical branch behavior
 Example: loop and if-statement branches
 Predict backward branches taken
 Predict forward branches not taken
 Dynamic branch prediction
 Hardware measures actual branch behavior
 e.g., record recent history of each branch
 Assume future behavior will continue the trend
 When wrong, stall while re-fetching, and update history

Chapter 4 — The Processor — 47


Pipeline Summary
The BIG Picture

 Pipelining improves performance by


increasing instruction throughput
 Executes multiple instructions in parallel
 Each instruction has the same latency
 Subject to hazards
 Structure, data, control
 Instruction set design affects complexity of
pipeline implementation
Chapter 4 — The Processor — 48
§4.6 Pipelined Datapath and Control
MIPS Pipelined Datapath

MEM

Right-to-left WB
flow leads to
hazards

Chapter 4 — The Processor — 49


Pipeline registers
 Need registers between stages
 To hold information produced in previous cycle

Chapter 4 — The Processor — 50


Pipeline Operation
 Cycle-by-cycle flow of instructions through
the pipelined datapath
 “Single-clock-cycle” pipeline diagram
 Shows pipeline usage in a single cycle
 Highlight resources used
 c.f. “multi-clock-cycle” diagram
 Graph of operation over time
 We’ll look at “single-clock-cycle” diagrams
for load & store

Chapter 4 — The Processor — 51


IF for Load, Store, …

Chapter 4 — The Processor — 52


ID for Load, Store, …

Chapter 4 — The Processor — 53


EX for Load

Chapter 4 — The Processor — 54


MEM for Load

Chapter 4 — The Processor — 55


WB for Load

Wrong
register
number

Chapter 4 — The Processor — 56


Corrected Datapath for Load

Chapter 4 — The Processor — 57


EX for Store

Chapter 4 — The Processor — 58


MEM for Store

Chapter 4 — The Processor — 59


WB for Store

Chapter 4 — The Processor — 60


Multi-Cycle Pipeline Diagram
 Form showing resource usage

Chapter 4 — The Processor — 61


Multi-Cycle Pipeline Diagram
 Traditional form

Chapter 4 — The Processor — 62


Single-Cycle Pipeline Diagram
 State of pipeline in a given cycle

Chapter 4 — The Processor — 63


Pipelined Control (Simplified)

Chapter 4 — The Processor — 64


Pipelined Control
 Control signals derived from instruction
 As in single-cycle implementation

Chapter 4 — The Processor — 65


Pipelined Control

Chapter 4 — The Processor — 66


§4.7 Data Hazards: Forwarding vs. Stalling
Data Hazards in ALU Instructions
 Consider this sequence:
sub $2, $1,$3
and $12,$2,$5
or $13,$6,$2
add $14,$2,$2
sw $15,100($2)
 We can resolve hazards with forwarding
 How do we detect when to forward?

Chapter 4 — The Processor — 67


Dependencies & Forwarding

Chapter 4 — The Processor — 68


Detecting the Need to Forward
 Pass register numbers along pipeline
 e.g., ID/EX.RegisterRs = register number for Rs
sitting in ID/EX pipeline register
 ALU operand register numbers in EX stage
are given by
 ID/EX.RegisterRs, ID/EX.RegisterRt
 Data hazards when
Fwd from
1a. EX/MEM.RegisterRd = ID/EX.RegisterRs EX/MEM
pipeline reg
1b. EX/MEM.RegisterRd = ID/EX.RegisterRt
2a. MEM/WB.RegisterRd = ID/EX.RegisterRs Fwd from
MEM/WB
2b. MEM/WB.RegisterRd = ID/EX.RegisterRt pipeline reg

Chapter 4 — The Processor — 69


Detecting the Need to Forward
 But only if forwarding instruction will write
to a register!
 EX/MEM.RegWrite, MEM/WB.RegWrite
 And only if Rd for that instruction is not
$zero
 EX/MEM.RegisterRd ≠ 0,
MEM/WB.RegisterRd ≠ 0

Chapter 4 — The Processor — 70


Forwarding Paths

Chapter 4 — The Processor — 71


Forwarding Conditions
 EX hazard
 if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs))
ForwardA = 10
 if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
ForwardB = 10
 MEM hazard
 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)
and (MEM/WB.RegisterRd = ID/EX.RegisterRs))
ForwardA = 01
 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)
and (MEM/WB.RegisterRd = ID/EX.RegisterRt))
ForwardB = 01

Chapter 4 — The Processor — 72


Double Data Hazard
 Consider the sequence:
add $1,$1,$2
add $1,$1,$3
add $1,$1,$4
 Both hazards occur
 Want to use the most recent
 Revise MEM hazard condition
 Only fwd if EX hazard condition isn’t true

Chapter 4 — The Processor — 73


Revised Forwarding Condition
 MEM hazard
 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)
and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs))
and (MEM/WB.RegisterRd = ID/EX.RegisterRs))
ForwardA = 01
 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)
and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
and (MEM/WB.RegisterRd = ID/EX.RegisterRt))
ForwardB = 01

Chapter 4 — The Processor — 74


Datapath with Forwarding

Chapter 4 — The Processor — 75


Load-Use Data Hazard

Need to stall
for one cycle

Chapter 4 — The Processor — 76


Load-Use Hazard Detection
 Check when using instruction is decoded
in ID stage
 ALU operand register numbers in ID stage
are given by
 IF/ID.RegisterRs, IF/ID.RegisterRt
 Load-use hazard when
 ID/EX.MemRead and
((ID/EX.RegisterRt = IF/ID.RegisterRs) or
(ID/EX.RegisterRt = IF/ID.RegisterRt))
 If detected, stall and insert bubble

Chapter 4 — The Processor — 77


How to Stall the Pipeline
 Force control values in ID/EX register
to 0
 EX, MEM and WB do nop (no-operation)
 Prevent update of PC and IF/ID register
 Using instruction is decoded again
 Following instruction is fetched again
 1-cycle stall allows MEM to read data for lw
 Can subsequently forward to EX stage

Chapter 4 — The Processor — 78


Stall/Bubble in the Pipeline

Stall inserted
here

Chapter 4 — The Processor — 79


Stall/Bubble in the Pipeline

Or, more
accurately…
Chapter 4 — The Processor — 80
Datapath with Hazard Detection

Chapter 4 — The Processor — 81


Stalls and Performance
The BIG Picture

 Stalls reduce performance


 But are required to get correct results
 Compiler can arrange code to avoid
hazards and stalls
 Requires knowledge of the pipeline structure

Chapter 4 — The Processor — 82


§4.8 Control Hazards
Branch Hazards
 If branch outcome determined in MEM

Flush these
instructions
(Set control
values to 0)

PC

Chapter 4 — The Processor — 83


Reducing Branch Delay
 Move hardware to determine outcome to ID
stage
 Target address adder
 Register comparator
 Example: branch taken
36: sub $10, $4, $8
40: beq $1, $3, 7
44: and $12, $2, $5
48: or $13, $2, $6
52: add $14, $4, $2
56: slt $15, $6, $7
...
72: lw $4, 50($7)

Chapter 4 — The Processor — 84


Example: Branch Taken

Chapter 4 — The Processor — 85


Example: Branch Taken

Chapter 4 — The Processor — 86


Data Hazards for Branches
 If a comparison register is a destination of
2nd or 3rd preceding ALU instruction

add $1, $2, $3 IF ID EX MEM WB

add $4, $5, $6 IF ID EX MEM WB

… IF ID EX MEM WB

beq $1, $4, target IF ID EX MEM WB

 Can resolve using forwarding

Chapter 4 — The Processor — 87


Data Hazards for Branches
 If a comparison register is a destination of
preceding ALU instruction or 2nd preceding
load instruction
 Need 1 stall cycle

lw $1, addr IF ID EX MEM WB

add $4, $5, $6 IF ID EX MEM WB

beq stalled IF ID

beq $1, $4, target ID EX MEM WB

Chapter 4 — The Processor — 88


Data Hazards for Branches
 If a comparison register is a destination of
immediately preceding load instruction
 Need 2 stall cycles

lw $1, addr IF ID EX MEM WB

beq stalled IF ID

beq stalled ID

beq $1, $0, target ID EX MEM WB

Chapter 4 — The Processor — 89


Dynamic Branch Prediction
 In deeper and superscalar pipelines, branch
penalty is more significant
 Use dynamic prediction
 Branch prediction buffer (aka branch history table)
 Indexed by recent branch instruction addresses
 Stores outcome (taken/not taken)
 To execute a branch
 Check table, expect the same outcome
 Start fetching from fall-through or target
 If wrong, flush pipeline and flip prediction

Chapter 4 — The Processor — 90


1-Bit Predictor: Shortcoming
 Inner loop branches mispredicted twice!
outer: …

inner: …

beq …, …, inner

beq …, …, outer

 Mispredict as taken on last iteration of


inner loop
 Then mispredict as not taken on first
iteration of inner loop next time around
Chapter 4 — The Processor — 91
2-Bit Predictor
 Only change prediction on two successive
mispredictions

Chapter 4 — The Processor — 92


Calculating the Branch Target
 Even with predictor, still need to calculate
the target address
 1-cycle penalty for a taken branch
 Branch target buffer
 Cache of target addresses
 Indexed by PC when instruction fetched
 If hit and instruction is branch predicted taken, can
fetch target immediately

Chapter 4 — The Processor — 93


Chapter Five

©2004 Morgan Kaufmann Publishers 1


The Processor: Datapath & Control

• We're ready to look at an implementation of the MIPS


• Simplified to contain only:
– memory-reference instructions: lw, sw
– arithmetic-logical instructions: add, sub, and, or, slt
– control flow instructions: beq, j
• Generic Implementation:
– use the program counter (PC) to supply instruction address
– get the instruction from memory
– read registers
– use the instruction to decide exactly what to do
• All instructions use the ALU after reading the registers
Why? memory-reference? arithmetic? control flow?

©2004 Morgan Kaufmann Publishers 2


Overview of the Implementation

• For every instruction, the first two steps are identical:


1. Send the PC to the memory that contains the code and
fetch the instruction from that memory.
2. Read 1 or 2 registers, using fields of the instruction to
select the registers to read. For load word instruction we
need to read only 1 register, but most other instructions
require that we read 2 registers.
• After these two steps, the actions required to complete the
instruction depend on the instruction class (memory-reference,
arithmetic-logical, and branches).
• All instruction classes use the ALU after reading the registers:
– Memory-reference instructions use ALU for an address
calculation.
– Arithmetic-logical instructions use ALU for operation
execution.
– Branch instructions use ALU for comparison
©2004 Morgan Kaufmann Publishers 3
Overview of the Implementation

• After using ALU, the actions required to complete the


different instruction classes differ:
– A memory-reference instruction will need to
access the memory either to write data for a store
or read data for a load.
– An arithmetic-logical instruction must write the
data from the ALU back into a register.
– A branch instruction may need to change the next
instruction address based on the comparison.

©2004 Morgan Kaufmann Publishers 4


More Implementation Details

• Abstract / Simplified View:

Add Add

Data

Register #
PC Address Instruction Registers ALU Address
Register # Data
Instruction
memory
memory Register #
Data

Two types of functional units:


– elements that operate on data values (combinational)
– elements that contain state (sequential)

©2004 Morgan Kaufmann Publishers 5


Implementation Details of MIPS

• The functional units in MIPS implementation consist of


two different types of logic elements:
1. Elements that operate on data values:
– These elements are all combinational, which means
that their outputs depend only on the current
inputs.
– Given the same input, a combinational element
always produces the same output, because it has
no internal storage.
– The ALU is a combinational element.

©2004 Morgan Kaufmann Publishers 6


Implementation Details of MIPS
2. Elements that contain state:
– An element contains state if it has some internal storage.
– We call these elements state elements because, if we pulled the
plug on the machine, we could restart it by loading the state
elements with the values they contained before we pulled the plug.
– Logic components that contain state called sequential because
their outputs depend on both their inputs and the contents of the
internal state.
– The instruction and data memories as well as the registers are all
examples of state elements.
– A state element has at least two inputs and one output.
– The required inputs are the data value to be written into the
element, and the clock, which determines when the data value is
written.
– The output from a state element provides the value that was written
in an earlier clock cycle.

©2004 Morgan Kaufmann Publishers 7


State Elements

• Unclocked vs. Clocked


• Clocks used in synchronous logic
– when should an element that contains state be updated?

Falling edge

Clock period Rising edge


cycle time

©2004 Morgan Kaufmann Publishers 8


An unclocked state element

• The set-reset latch


– output depends on present inputs and also on past inputs

R
Q

Q
S

©2004 Morgan Kaufmann Publishers 9


Latches and Flip-flops

• Output is equal to the stored value inside the element


(don't need to ask for permission to look at the value)
• Change of state (value) is based on the clock
• Latches: whenever the inputs change, and the clock is asserted
• Flip-flop: state changes only on a clock edge
(edge-triggered methodology)

"logically true",
— could mean electrically low

A clocking methodology defines when signals can be read and written


— wouldn't want to read a signal at the same time it was being written

©2004 Morgan Kaufmann Publishers 10


D-latch

• Two inputs:
– the data value to be stored (D)
– the clock signal (C) indicating when to read & store D
• Two outputs:
– the value of the internal state (Q) and it's complement

C
D
Q
C

Q
_
Q
D

©2004 Morgan Kaufmann Publishers 11


Clocking Methodology

• A clocking methodology defines when signals can be read


and when they can be written.
• If a signal is written at the same time it is read, the value of
the read could correspond to the old value, the newly written
value, or even some mix of the two!
• Assume an edge-triggered clocking methodology, which
means that any values stored in the machine are updated
only on a clock edge.
• The state elements all update their internal storage on the
clock edge.
• Because only state elements can store a data value, any
collection of combinational logic must have its inputs coming
from a set of state elements and its outputs written into a set
of state elements.
• The inputs are values that were written in a previous clock
cycle, while the outputs are values that can be used in the
following clock cycle.

©2004 Morgan Kaufmann Publishers 12


D flip-flop

• Output changes only on the clock edge

D D Q D Q
D D Q
latch latch
Q
C C Q

©2004 Morgan Kaufmann Publishers 13


Our Implementation

• An edge triggered methodology


• Typical execution:
– read contents of some state elements,
– send values through some combinational logic
– write results to one or more state elements

State State
element Combinational logic element
1 2

Clock cycle

©2004 Morgan Kaufmann Publishers 14


Combinational logic, state elements, and clock are
closely related

• From the figure in the previous slide, the two state elements
surrounding a block of combinational logic, which operates
in a single clock cycle:
– All signals must propagate from state element 1, through
the combinational logic, and to state element 2 in the time
of one clock cycle.
– The time for signals to reach state element 2 defines the
length of the clock cycle.
• Both the clock signal and the write control signal are inputs,
and the state element is changed only when the write control
signal is asserted and a clock edge occurs.

©2004 Morgan Kaufmann Publishers 15


Instruction Memory, Program Counter, and Adder

• First element we need is a place to store the


instructions of a program (a memory unit).

• To execute any instruction, we must start by fetching


the instruction from memory.

• To prepare for executing the next instruction, we


must increment the program counter (PC), so that it
points at the next instruction, 4 bytes later.

• Therefore, two state elements are needed to store


and access instructions, and an adder is needed to
compute the next instruction.

©2004 Morgan Kaufmann Publishers 16


A portion of the datapath used for fetching
instructions and incrementing the PC

• The PC is a 32-bit register that will be written at the end of every


clock cycle.
• The adder is an ALU wired to always perform an add of its two
32-bit inputs and place the result on its output.

Add Sum

Instruction
PC address

Instruction
Instruction
memory

©2004 Morgan Kaufmann Publishers 17


Register File

• R-format instructions (add, sub, slt, and, or) all


read two registers, perform an ALU operation on
the contents of the registers, and write the result.
• The processor’s 32 registers are stored in a
structure called a register file.
• A register file is a collection of registers in which
any register can be read or written by specifying
the number of the register in the file.
• R-format instructions have 3 register operands, we
will need to read two data words from the register
file and write one data word into the register file for
each instruction.

©2004 Morgan Kaufmann Publishers 18


Register File

• For each data word to be read from the registers:


– We need an input to the register file that
specifies the register number to be read
– We need an output from the register file that will
carry the value that has been read from the
registers.
• To write a data word, we need two inputs:
– One to specify the register number to be written
– One to supply the data to be written into the
register.
• We need a total of 4 inputs (3 for register numbers
and 1 for data) and 2 outputs (both for data) as
shown in the next slide.
©2004 Morgan Kaufmann Publishers 19
Register File

• Built using D flip-flops Read register


number 1
Register 0

Register 1
M
... u Read data 1
Read register x
number 1 Register n – 2
Read
Read register data 1 Register n – 1
number 2
Register file
Write Read
register data 2 Read register
number 2
Write
data Write

M
u Read data 2
x

Do you understand? What is the “Mux” above?

©2004 Morgan Kaufmann Publishers 20


Register File

• Note: we still use the real clock to determine when to write

Write

C
0
1 Register 0

n-to-2n .. D
Register number .
decoder
C
Register 1
n–1
D
n

.
..

C
Register n – 2
D

C
Register n – 1
Register data D

©2004 Morgan Kaufmann Publishers 21


Register File and ALU
• The register number inputs are 5 bits wide to
specify one of the 32 registers (32=25),
whereas the data input and the two output
buses are each 32 bits wide.

• ALU is controlled by 4-bit signal and it takes


two 32-bit inputs and produces a 32-bit result
as shown in the next slide.

©2004 Morgan Kaufmann Publishers 22


Register File and ALU

ALU control
5 Read 4
register 1
Read
Register 5 data 1
Read
numbers register 2 Zero
Registers Data ALU ALU
5 Write result
register
Read
Write data 2
Data data

RegWrite

a. Registers b. ALU

©2004 Morgan Kaufmann Publishers 23


The datapath for R-type instructions

• Two elements needed to implement R-format ALU operations are


register file and ALU
A L U c o n t r o l
5 4
R e a d

re g is te r 1
R e a d

d a ta 1
5
R e g is t e r R e a d

re g is te r 2 Z e r o
n u m b e rs
Instruction R e g is t e r s D a ta A L U
A L U
5
W r ite
r e s u lt
re g is te r
R e a d

d a ta 2
W r ite
D a ta
d a ta

R e g W rite

©2004 Morgan Kaufmann Publishers 24


Simple Implementation

• Include the functional units we need for each instruction

Instruction MemWrite
address
Read
Instruction PC Add Sum Address data
16 32
Instruction Sign
memory Data extend
Write memory

a. Instruction memory b. Program counter c. Adder data

MemRead

a. Data memory unit b. Sign-extension unit

5 Read ALU operation


register 1 4
Read
Register 5 data 1
Read
numbers register 2 Zero
Data ALU ALU
5 Registers
Write result
register Read
Write
data 2 Why do we need this stuff?
Data
Data
RegWrite

a. Registers b. ALU

©2004 Morgan Kaufmann Publishers 25


Building the Datapath

• Use multiplexors to stitch them together

PCSrc

M
Add u
x
ALU
4 Add
result
Shift
left 2

Read ALUSrc ALU operation


Read register 1 4
PC address Read MemWrite
data 1
Read MemtoReg
register 2 Zero
Instruction ALU ALU
Registers Read Read
Write Address
data 2 result data M
Instruction register M
memory u u
x x
Write
data Data
Write memory
RegWrite data

16 32 MemRead
Sign
extend

©2004 Morgan Kaufmann Publishers 26


MIPS load word and store word instructions

• Load word instruction: lw $t1, offset_value($t2)


• Store word instruction: sw $t1, offset_value($t2)

• These instruction compute a memory address by


adding the base register, which is $t2, to the 16-bit
signed offset field contained in the instruction.
• If the instruction is a store, the value to be stored
must be read from the register file where it resides
in $t1.
• If the instruction is a load, the value read from the
memory must be written into register file in the
specified register, which is $t1.

©2004 Morgan Kaufmann Publishers 27


To implement a datapath for MIPS load and store instructions

• Therefore we will need the following:


– Both the register file and ALU.
– A unit to sign-extend the 16-bit offset field in the
instruction to a 32-bit signed value.
– A data memory unit to read from or write to.

• The data memory must be written on store


instructions, it has both read and write control
signals, an address input, and an input for the data
to be written into memory (check out next slide).

©2004 Morgan Kaufmann Publishers 28


Data Memory and Sign-extension Units
M e m W r it e

1 6 3 2
A d d re s s R e a d S ig n
d a ta
e x t e n d

D a ta
W rite

d a ta m e m o ry

M e m R e a d

a . D a ta m e m o ry u n it b . S ig n - e x t e n s io n u n it

©2004 Morgan Kaufmann Publishers 29


Datapath for MIPS load and store instructions

• Assume the instruction has already been fetched.

• The register number inputs for the register file


come from fields of the instruction, as does the
offset value, which after sign extension becomes
the second ALU input.

• The figure in the next slide shows the datapath for


a load or a store does a register access, followed
by a memory address calculation, then a read or
write from memory, and a write into the register file
if the instruction is a load.

©2004 Morgan Kaufmann Publishers 30


Datapath for MIPS load and store instructions
A LU c o n tr o l
5 4
R e ad

r e g ist e r 1
Read M e m W rite

5 d ata 1
R e g is te r R e ad

n u m be r s r e g ist e r 2 Z e ro
Instruction R e g is te r s A L U
A L U
5
W r ite A d dre ss R e ad
r e s u lt
d ata
r e g ist e r
Read

d ata 2
W r ite D ata
D a ta W rite
d a ta
da ta m e m ory

R e g W rite

M e m R ea d

1 6 3 2
S ig n
e x te n d
a. D ata m e m o ry un it

©2004 Morgan Kaufmann Publishers 31


Branch if equal (beq) instruction

• beq instruction has 3 operands, 2 registers that are


compared for equality, and a 16-bit offset used to
compute the branch target address relative to the
branch instruction address.
• Form: beq $t1, $t2, offset
• To implement this instruction, we must compute
the branch target address by adding the sign-
extended offset field of the instruction to the PC.

©2004 Morgan Kaufmann Publishers 32


beq Instruction

• There are two details in the definition of the branch


instruction (for details see chapter 3):

1. The instruction set architecture specifies that the base


for the branch address calculation is the address of the
instruction following the branch.
Since we compute PC+4 (address of next instruction), it
is easy to use this value as the base for computing the
branch target address.

2. The architecture states that the offset field is shifted


left 2 bits so that it is a word offset; this shift increase
the effective range of the offset field by a factor of four.

©2004 Morgan Kaufmann Publishers 33


beq Instruction (Cont.)

• If the two operands are equal (condition is true)


then the branch target address becomes the new
PC, and we say the branch is taken.
• If the two operands are not equal (condition is not
true) then the incremented PC should replace the
current PC, and we say the branch is not taken.
• The branch datapath must do two operations:
compute the branch target address and compare
the register contents.

©2004 Morgan Kaufmann Publishers 34


A Datapath for MIPS beq Instruction

PC + 4 from instruction datapath


Branch
A dd Sum
target
Shift
left 2

5 R e ad A L U c o n tr o l
4
re g is te r 1
R e a d

5 d a ta 1
R e ad
Instruction re g is te r 2 To
R e g is t e r s
A L U Z e ro branch
W r ite
re g is te r
control
R e a d
logic
d a ta 2
W r ite
d a ta

R e g W r ite

16 32
S ig n
e x te n d

©2004 Morgan Kaufmann Publishers 35


A Datapath for MIPS beq Instruction
• To compute the branch target address, the branch
datapath includes a sign extension unit and an adder.
• To perform the compare, we need to use the register file to
supply the two register operands.
• Also, the ALU provides an output signal that indicates
whether the result was 0, we can send the two register
operands to the ALU with the control set to do a subtract.
– If the Zero signal out of the ALU unit is asserted, then
the two values are equal.
• Since the offset was sign-extended from 16 bits, the shift
will throw away only “sign bits”.
• Control logic is used to decide whether the incremented
PC or branch target should replace the PC, based on the
Zero output of the ALU.

©2004 Morgan Kaufmann Publishers 36


Combining datapaths for memory and R-type instructions
using multiplexors
R e g is te r s
A L U o p e r a tio n
R ead
A L U S rc
4 M e m W r i te Load
re g is t e r 1
R ead
R ead d a ta 1 M e m to R e g
re g is te r 2 Z e ro
instruction A LU ALU
W r ite R ea d R ead
r e s u lt A d d re s s
re g is t e r M d a ta
d a ta 2 M
u
u
W r ite x
D a ta x
d a ta
m e m o ry
W r i te
R e g W r it e
d a ta
Store
16 32
S ig n
M em R ead
e x te n d

R-Type
R-Type or Load
©2004 Morgan Kaufmann Publishers 37
Combining datapaths for memory instructions, R-type
instructions, and instruction fetch

Add Sum
4
Registers
ALU operation
4
Instruction
Read
register 1 ALUSrc
MemWrite
Load
PC address Read
Read data 1 MemtoReg
register 2 Zero

Instruction ALU ALU


Write Read Read
result Address
register data 2 M data
Instruction u
M
memory u
Write x
Data x
data
memory
Write
RegWrite
data
16 32
Store
Sign
MemRead
extend
R-Type
R-Type or Load
©2004 Morgan Kaufmann Publishers 38
Combining datapaths for memory instructions, R-type
instructions, and instruction fetch

• The combined datapaths in the previous slide


requires both an adder and an ALU, since the adder
is used to increment the PC while the other ALU is
used for executing the instruction in the same
clock cycle.

• In the next slide we build a simple datapath for the


MIPS architecture, which can execute the basic
instructions (load / store word, ALU operations,
and branches) in a single clock cycle.

©2004 Morgan Kaufmann Publishers 39


Building the Datapath for MIPS Architecture

• Use multiplexors to stitch them together


PCSrc

M
Add u
x
4 Add ALU
result
Shift
left 2
Registers
Read 3 ALU operation
MemWrite
Read register 1 ALUSrc
PC Read
address Read data 1 MemtoReg
register 2 Zero
Instruction ALU ALU
Write Read Address Read
register M result data
data 2 u M
Instruction u
memory Write x Data x
data memory
Write
RegWrite data
16 32
Sign
extend MemRead

©2004 Morgan Kaufmann Publishers 40


Building the Datapath for MIPS Architecture

• From the previous slide, the branch instruction


uses the main ALU for comparison of the register
operands, so the adder is used for computing the
branch target address.

• Also, an additional multiplexor is required to select


either the sequentially following instruction
address (PC + 4) or the branch target address to be
written into the PC.

©2004 Morgan Kaufmann Publishers 41


Control

• Selecting the operations to perform (ALU, read/write, etc.)


• Controlling the flow of data (multiplexor inputs)
• Information comes from the 32 bits of the instruction
• Example:

add $8, $17, $18 Instruction Format:

000000 10001 10010 01000 00000 100000

op rs rt rd shamt funct

• ALU's operation based on instruction type and function code

©2004 Morgan Kaufmann Publishers 42


Control

• e.g., what should the ALU do with this instruction


• Example: lw $1, 100($2)

35 2 1 100

op rs rt 16 bit offset

• ALU control input

0000 AND
0001 OR
0010 add
0110 subtract
0111 set-on-less-than
1100 NOR

• Why is the code for subtract 0110 and not 0011?


©2004 Morgan Kaufmann Publishers 43
Control

• Must describe hardware to compute 4-bit ALU control input


– given instruction type
00 = lw, sw ALUOp
01 = beq, computed from instruction type
10 = arithmetic
– function code for arithmetic

• Describe it using a truth table (can turn into gates):

©2004 Morgan Kaufmann Publishers 44


0
M
Add u
x
ALU 1
4 Add
result
Shift
RegDst left 2
Branch
MemRead
Instruction [31–26] MemtoReg
Control
ALUOp
MemWrite
ALUSrc
RegWrite

Instruction [25–21] Read


Read register 1
PC address Read
Instruction [20–16] data 1
Read
register 2 Zero
Instruction 0
[31–0] ALU ALU Read
M Read 0 Address 1
u Write result data
Instruction data 2 M M
Instruction [15–11] x register u
memory u
1 x x
Write 1 0
data Registers Data
Write memory
data
Instruction [15–0] 16 32
Sign
ALU
extend
control

Instruction [5–0]

Memto- Reg Mem Mem


Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0
R-format 1 0 0 1 0 0 0 1 0
lw 0 1 1 1 1 0 0 0 0
sw X 1 X 0 0 1 0 0 0
beq X 0 X 0 0 0 1 0 1
Control

• Simple combinational logic (truth tables)

Inputs
Op5
Op4
Op3
Op2
Op1
Op0

Outputs
R-format Iw sw beq
RegDst

ALUSrc
MemtoReg
RegWrite
MemRead
MemWrite

Branch
ALUOp1

ALUOpO

©2004 Morgan Kaufmann Publishers 46


Single Cycle Implementation

• Calculate cycle time assuming negligible delays except:


– memory (200ps),
ALU and adders (100ps),
register file access (50ps)
PCSrc

M
Add u
x
ALU
4 Add
result
Shift
left 2

Read ALUSrc ALU operation


Read register 1 4
PC address Read MemWrite
data 1
Read MemtoReg
register 2 Zero
Instruction ALU ALU
Registers Read Read
Write Address
data 2 result data M
Instruction register M
memory u u
x x
Write
data Data
Write memory
RegWrite data

16 32 MemRead
Sign
extend

©2004 Morgan Kaufmann Publishers 47


Where we are headed

• Single Cycle Problems:


– what if we had a more complicated instruction like floating
point?
– wasteful of area
• One Solution:
– use a “smaller” cycle time
– have different instructions take different numbers of cycles
– a “multicycle” datapath:

Instruction
register
PC Address Data
A
Instruction Register #
or data Registers ALU ALUOut
Memory
Register #
Memory B
Data data Register #
register

©2004 Morgan Kaufmann Publishers 48


Multicycle Approach

• We will be reusing functional units


– ALU used to compute address and to increment PC
– Memory used for instruction and data
• Our control signals will not be determined directly by instruction
– e.g., what should the ALU do for a “subtract” instruction?
• We’ll use a finite state machine for control

©2004 Morgan Kaufmann Publishers 49


Multicycle Approach

• Break up the instructions into steps, each step takes a cycle


– balance the amount of work to be done
– restrict each cycle to use only one major functional unit
• At the end of a cycle
– store values for use in later cycles (easiest thing to do)
– introduce additional “internal” registers

PC 0
Instruction Read 0
M
u Address [25–21] register 1 M
x Read u
A x
1 Instruction data 1
Memory Read 1 Zero
[20–16] register 2
MemData 0 ALU ALU
Instruction M Registers ALUOut
[15–0] Instruction u Write result
Read
Write [15–11] x register B 0
data 2
data Instruction 1 4 1M
register Write u
0 data 2 x
Instruction M 3
[15–0] u
x
1
Memory 16 32
Sign Shift
data extend left 2
register

©2004 Morgan Kaufmann Publishers 50


Instructions from ISA perspective

• Consider each instruction from perspective of ISA.


• Example:
– The add instruction changes a register.
– Register specified by bits 15:11 of instruction.
– Instruction specified by the PC.
– New value is the sum (“op”) of two registers.
– Registers specified by bits 25:21 and 20:16 of the instruction

Reg[Memory[PC][15:11]] <= Reg[Memory[PC][25:21]] op


Reg[Memory[PC][20:16]]

– In order to accomplish this we must break up the instruction.


(kind of like introducing variables when programming)

©2004 Morgan Kaufmann Publishers 51


Breaking down an instruction

• ISA definition of arithmetic:

Reg[Memory[PC][15:11]] <= Reg[Memory[PC][25:21]] op


Reg[Memory[PC][20:16]]

• Could break down to:


– IR <= Memory[PC]
– A <= Reg[IR[25:21]]
– B <= Reg[IR[20:16]]
– ALUOut <= A op B
– Reg[IR[20:16]] <= ALUOut

• We forgot an important part of the definition of arithmetic!


– PC <= PC + 4

©2004 Morgan Kaufmann Publishers 52


Idea behind multicycle approach

• We define each instruction from the ISA perspective (do this!)

• Break it down into steps following our rule that data flows through at
most one major functional unit (e.g., balance work across steps)

• Introduce new registers as needed (e.g, A, B, ALUOut, MDR, etc.)

• Finally try and pack as much work into each step


(avoid unnecessary cycles)
while also trying to share steps where possible
(minimizes control, helps to simplify solution)

• Result: Our book’s multicycle Implementation!

©2004 Morgan Kaufmann Publishers 53


Five Execution Steps

• Instruction Fetch

• Instruction Decode and Register Fetch

• Execution, Memory Address Computation, or Branch Completion

• Memory Access or R-type instruction completion

• Write-back step

INSTRUCTIONS TAKE FROM 3 - 5 CYCLES!

©2004 Morgan Kaufmann Publishers 54


Step 1: Instruction Fetch

• Use PC to get instruction and put it in the Instruction Register.


• Increment the PC by 4 and put the result back in the PC.
• Can be described succinctly using RTL "Register-Transfer Language"

IR <= Memory[PC];
PC <= PC + 4;

Can we figure out the values of the control signals?

What is the advantage of updating the PC now?

©2004 Morgan Kaufmann Publishers 55


Step 2: Instruction Decode and Register Fetch

• Read registers rs and rt in case we need them


• Compute the branch address in case the instruction is a branch
• RTL:

A <= Reg[IR[25:21]];
B <= Reg[IR[20:16]];
ALUOut <= PC + (sign-extend(IR[15:0]) << 2);

• We aren't setting any control lines based on the instruction type


(we are busy "decoding" it in our control logic)

©2004 Morgan Kaufmann Publishers 56


Step 3 (instruction dependent)

• ALU is performing one of three functions, based on instruction type

• Memory Reference:

ALUOut <= A + sign-extend(IR[15:0]);

• R-type:

ALUOut <= A op B;

• Branch:

if (A==B) PC <= ALUOut;

©2004 Morgan Kaufmann Publishers 57


Step 4 (R-type or memory-access)

• Loads and stores access memory

MDR <= Memory[ALUOut];


or
Memory[ALUOut] <= B;

• R-type instructions finish

Reg[IR[15:11]] <= ALUOut;

The write actually takes place at the end of the cycle on the edge

©2004 Morgan Kaufmann Publishers 58


Write-back step

• Reg[IR[20:16]] <= MDR;

Which instruction needs this?

©2004 Morgan Kaufmann Publishers 59


Summary:

©2004 Morgan Kaufmann Publishers 60


Simple Questions

• How many cycles will it take to execute this code?

lw $t2, 0($t3)
lw $t3, 4($t3)
beq $t2, $t3, Label #assume not
add $t5, $t2, $t3
sw $t5, 8($t3)
Label: ...

• What is going on during the 8th cycle of execution?


• In what cycle does the actual addition of $t2 and $t3 takes place?

©2004 Morgan Kaufmann Publishers 61

Das könnte Ihnen auch gefallen