Alllpdf PDF

COMPUTER ORGANIZATION AND DESIGN 5th
Edition
The Hardware/Software Interface
Chapter 1
Computer Abstractions
and Technology
Classes of Computers
 Personal computers
 General purpose, variety of software
 Subject to cost/performance tradeoff
 Server computers
 Network based
 High capacity, performance, reliability
 Range from small servers to building sized
Chapter 1 — Computer Abstractions and Technology — 2

Classes of Computers
 Supercomputers
 High-end scientific and engineering
calculations
 Highest capability but represent a small
fraction of the overall computer market
 Embedded computers
 Hidden as components of systems
 Stringent power/performance/cost constraints

What You Will Learn
 How programs are translated into the
machine language
 And how the hardware executes them
 The hardware/software interface
 What determines program performance
 And how it can be improved
 How hardware designers improve
performance
 What is parallel processing
Understanding Performance
 Algorithm
 Determines number of operations executed
 Programming language, compiler, architecture
 Determine number of machine instructions executed
per operation
 Processor and memory system
 Determine how fast instructions are executed
 I/O system (including OS)
 Determines how fast I/O operations are executed

§1.2 Eight Great Ideas in Computer Architecture
Eight Great Ideas
 Design for Moore’s Law
 Use abstraction to simplify design
 Make the common case fast
 Performance via parallelism
 Performance via pipelining
 Performance via prediction
 Hierarchy of memories

§1.3 Below Your Program
Below Your Program
 Application software
 Written in high-level language
 System software
 Compiler: translates HLL code to
machine code
 Operating System: service code
 Handling input/output
 Managing memory and storage
 Scheduling tasks & sharing resources
 Hardware
 Processor, memory, I/O controllers

Levels of Program Code
 High-level language
 Level of abstraction closer
to problem domain
 Provides for productivity
and portability
 Assembly language
 Textual representation of
instructions
 Hardware representation
 Binary digits (bits)
 Encoded instructions and
data

§1.4 Under the Covers
Components of a Computer
The BIG Picture  Same components for
all kinds of computer
 Desktop, server,
embedded
 Input/output includes
 User-interface devices
 Display, keyboard, mouse
 Storage devices
 Hard disk, CD/DVD, flash
 Network adapters
 For communicating with
other computers

Inside the Processor (CPU)
 Datapath: performs operations on data
 Control: sequences datapath, memory, ...
 Cache memory
 Small fast SRAM memory for immediate
access to data

Inside the Processor
 Apple A5 32-SoC

A Safe Place for Data
 Volatile main memory
 Loses instructions and data when power off
 Non-volatile secondary memory
 Magnetic disk
 Flash memory
 Optical disk (CDROM, DVD)

§1.5 Technologies for Building Processors and Memory
Technology Trends
 Electronics
technology
continues to evolve
 Increased capacity
and performance
 Reduced cost
DRAM capacity
Year Technology Relative performance/cost

1951 Vacuum tube 1
1965 Transistor 35
1975 Integrated circuit (IC) 900
1995 Very large scale IC (VLSI) 2,400,000
2013 Ultra large scale IC 250,000,000,000

Manufacturing ICs
 Yield: proportion of working dies per wafer

Intel Core i7 Wafer
 300mm wafer, 280 chips, 32nm technology

 Each chip is 20.7 x 10.5 mm
Integrated Circuit Cost
Cost per wafer
Cost per die 
Dies per wafer  Yield
Dies per wafer  Wafer area Die area
1
Yield 
(1 (Defects per area  Die area/2))2
 Nonlinear relation to area and defect rate

 Wafer cost and area are fixed
 Defect rate determined by manufacturing process
 Die area determined by architecture and circuit design

§1.6 Performance
Defining Performance
 Which airplane has the best performance?
Boeing 777 Boeing 777
BAC/Sud BAC/Sud
Concorde Concorde
Douglas Douglas DC-
DC-8-50 8-50
0 100 200 300 400 500 0 2000 4000 6000 8000 10000
Passenger Capacity Cruising Range (miles)
BAC/Sud BAC/Sud
Concorde Concorde
Douglas Douglas DC-
DC-8-50 8-50
0 500 1000 1500 0 100000 200000 300000 400000
Cruising Speed (mph) Passengers x mph

Response Time and Throughput
 Response time
 How long it takes to do a task
 Throughput
 Total work done per unit time
 e.g., tasks/transactions/… per hour
 How are response time and throughput affected
by
 Replacing the processor with a faster version?
 Adding more processors?
 We’ll focus on response time for now…

Relative Performance
 Define Performance = 1/Execution Time
 “X is n time faster than Y”
Performance X Performance Y
 Execution timeY Execution timeX  n
 Example: time taken to run a program

 10s on A, 15s on B
 Execution TimeB / Execution TimeA
= 15s / 10s = 1.5
 So A is 1.5 times faster than B
Measuring Execution Time
 Elapsed time
 Total response time, including all aspects
 Processing, I/O, OS overhead, idle time
 Determines system performance
 CPU time
 Time spent processing a given job
 Discounts I/O time, other jobs’ shares
 Comprises user CPU time and system CPU
time
 Different programs are affected differently by
CPU and system performance
CPU Clocking
 Operation of digital hardware governed by a
constant-rate clock
Clock period
Clock (cycles)
Data transfer
and computation
Update state
 Clock period: duration of a clock cycle

 e.g., 250ps = 0.25ns = 250×10–12s
 Clock frequency (rate): cycles per second
 e.g., 4.0GHz = 4000MHz = 4.0×109Hz
CPU Time
CPU Time  CPU Clock Cycles  Clock Cycle Time
CPU Clock Cycles

Clock Rate
 Performance improved by
 Reducing number of clock cycles
 Increasing clock rate
 Hardware designer must often trade off clock
rate against cycle count

CPU Time Example
 Computer A: 2GHz clock, 10s CPU time
 Designing Computer B
 Aim for 6s CPU time
 Can do faster clock, but causes 1.2 × clock cycles
 How fast must Computer B clock be?
Clock CyclesB 1.2  Clock Cycles A
Clock RateB  
CPU TimeB 6s
Clock Cycles A  CPU TimeA  Clock Rate A
 10s  2GHz  20  109
1.2  20  109 24  109
Clock RateB    4GHz
6s 6s
Instruction Count and CPI
Clock Cycles  Ins truction Count Cycles per Ins truction
CPU Tim e  Ins truction Count CPI Clock Cycle Tim e
Ins truction Count CPI

ClockRate
 Instruction Count for a program
 Determined by program, ISA and compiler
 Average cycles per instruction
 Determined by CPU hardware
 If different instructions have different CPI
 Average CPI affected by instruction mix

CPI Example
 Computer A: Cycle Time = 250ps, CPI = 2.0
 Computer B: Cycle Time = 500ps, CPI = 1.2
 Same ISA
 Which is faster, and by how much?
CPU Tim e  Ins truction Count CPI  Cycle Tim e
A A A
 I  2.0  250ps  I  500ps A is faster…
CPU Tim e  Ins truction Count CPI  Cycle Tim e
B B B
 I  1.2  500ps  I  600ps
B  I  600ps  1.2
CPU Tim e
…by this much
CPU Tim e I  500ps
A
CPI in More Detail
 If different instruction classes take different
numbers of cycles
n
Clock Cycles   (CPIi  Instruction Counti )
i1
 Weighted average CPI

Clock Cycles n
 Instruction Counti 
CPI     CPIi  
Instruction Count i1  Instruction Count 
Relative frequency

CPI Example
 Alternative compiled code sequences using
instructions in classes A, B, C
Class A B C
CPI for class 1 2 3
IC in sequence 1 2 1 2
IC in sequence 2 4 1 1
 Sequence 1: IC = 5  Sequence 2: IC = 6
 Clock Cycles  Clock Cycles
= 2×1 + 1×2 + 2×3 = 4×1 + 1×2 + 1×3
= 10 =9
 Avg. CPI = 10/5 = 2.0  Avg. CPI = 9/6 = 1.5
Performance Summary
The BIG Picture
Instructions Clock cycles Seconds

CPU Time   
Program Instruction Clock cycle
 Performance depends on
 Algorithm: affects IC, possibly CPI
 Programming language: affects IC, CPI
 Compiler: affects IC, CPI
 Instruction set architecture: affects IC, CPI, Tc

Multiprocessors
 Multicore microprocessors
 More than one processor per chip
 Requires explicitly parallel programming
 Compare with instruction level parallelism
 Hardware executes multiple instructions at once
 Hidden from the programmer
 Hard to do
 Programming for performance
 Load balancing
 Optimizing communication and synchronization

SPEC CPU Benchmark
 Programs used to measure performance
 Supposedly typical of actual workload
 Standard Performance Evaluation Corp (SPEC)
 Develops benchmarks for CPU, I/O, Web, …
 SPEC CPU2006
 Elapsed time to execute a selection of programs
 Negligible I/O, so focuses on CPU performance
 Normalize relative to reference machine
 Summarize as geometric mean of performance ratios
 CINT2006 (integer) and CFP2006 (floating-point)
n
n
 Execution time ratio
i1
i

§1.10 Fallacies and Pitfalls
Pitfall: Amdahl’s Law
 Improving an aspect of a computer and
expecting a proportional improvement in
overall performance
Taffected
Timproved   Tunaffected
improvement factor
 Example: multiply accounts for 80s/100s
 How much improvement in multiply performance to
get 5× overall?
80  Can’t be done!
20   20
n
 Corollary: make the common case fast
Pitfall: MIPS as a Performance Metric
 MIPS: Millions of Instructions Per Second
 Doesn’t account for
 Differences in ISAs between computers
 Differences in complexity between instructions
Instruction count
MIPS 
Execution time  106
Instruction count Clock rate
 
Instruction count  CPI CPI  10 6
 10 6
Clock rate
 CPI varies between programs on a given CPU
§1.9 Concluding Remarks
Concluding Remarks
 Cost/performance is improving
 Due to underlying technology development
 Hierarchical layers of abstraction
 In both hardware and software
 Instruction set architecture
 The hardware/software interface
 Execution time: the best performance
measure
 Power is a limiting factor
 Use parallelism to improve performance
Chapter 2
Instructions: Language
of the Computer
HW#1:
1.3 all, 1.4 all, 1.6.1, 1.14.4, 1.14.5, 1.14.6, 1.15.1, and 1.15.4
Due date: one week.
Practice:
1.5 all, 1.6 all, 1.10 all, 1.11 all, 1.14 all, and1.15 all
§2.1 Introduction
Instruction Set
 The repertoire of instructions of a
computer
 Different computers have different
instruction sets
 But with many aspects in common
 Early computers had very simple
instruction sets
 Simplified implementation
 Many modern computers also have simple
instruction sets
Dr. Yahya Tashtoush

Two Key Principles of Machine Design
1. Instructions are represented as numbers and, as
such, are indistinguishable from data
2. Programs are stored in alterable memory (that can
Memory
be read or written to)
just like data Accounting prg
(machine code)
 Stored-program concept
C compiler
 Programs can be shipped as files
(machine code)
of binary numbers – binary
compatibility
Payroll
 Computers can inherit ready-made data
software provided they are
compatible with an existing ISA – Source code in
leads industry to align around a C for Acct prg
small number of ISAs
Dr. Yahya Tashtoush

MIPS-32 ISA
Registers
 Instruction Categories
 Computational
R0 - R31
 Load/Store
 Jump and Branch
 Floating Point
 coprocessor PC
 Memory Management HI
 Special LO
3 Instruction Formats: all 32 bits wide
op rs rt rd sa funct R format
op rs rt immediate I format
op jump target J format

Dr. Yahya Tashtoush
MIPS (RISC) Design Principles
 Simplicity favors regularity
 fixed size instructions
 small number of instruction formats
 opcode always the first 6 bits
 Smaller is faster
 limited instruction set
 limited number of registers in register file
 limited number of addressing modes
 Make the common case fast
 arithmetic operands from the register file (load-store
machine)
 allow instructions to contain immediate operands
 Good design demands good compromises
 three instruction formats
Dr. Yahya Tashtoush
Instructions
 MIPS assembly language arithmetic statement
add $t0, $s1, $s2
sub $t0, $s1, $s2
 Each arithmetic instruction performs one operation

 Each specifies exactly three operands that are all
contained in the datapath’s register file ($t0,$s1,$s2)
destination  source1 op source2
 Instruction Format (R format)
0 17 18 8 0 0x22
Dr. Yahya Tashtoush

MIPS Instruction Fields
 MIPS fields are given names to make
them easier to refer to
op rs rt rd shamt funct
op 6-bits opcode that specifies the operation

rs 5-bits register file address of the first source operand
rt 5-bits register file address of the second source operand
rd 5-bits register file address of the result’s destination
shamt 5-bits shift amount (for shift instructions)
funct 6-bits function code augmenting the opcode
Dr. Yahya Tashtoush

MIPS Register File
Register File
32 bits
 Holds thirty-two 32-bit registers 5
src1 addr 32 src1
 Two read ports and data
5
 One write port src2 addr 32
5 locations
 Registers are dst addr
32 src2
 Faster than main memory 32
write data data
- But register files with more locations
are slower (e.g., a 64 word file could
write control
be as much as 50% slower than a 32 word file)
- Read/write port increase impacts speed quadratically
 Easier for a compiler to use
- e.g., (A*B) – (C*D) – (E*F) can do multiplies in any order vs.
stack
 Can hold variables so that
- code density improves (since register are named with fewer bits
than a memory location)
Dr. Yahya Tashtoush

Convention
Name Register Usage Preserve
Number on call?
$zero 0 constant 0 (hardware) n.a.
$at 1 reserved for assembler n.a.
$v0 - $v1 2-3 returned values no
$a0 - $a3 4-7 arguments yes
$t0 - $t7 8-15 temporaries no
$s0 - $s7 16-23 saved values yes
$t8 - $t9 24-25 temporaries no
$gp 28 global pointer yes
$sp 29 stack pointer yes
$fp 30 frame pointer yes
$ra 31 return addr (hardware) yes
§2.2 Operations of the Computer Hardware
Arithmetic Operations
 Add and subtract, three operands
 Two sources and one destination
add a, b, c # a gets b + c
 All arithmetic operations have this form
 Design Principle 1: Simplicity favours
regularity
 Regularity makes implementation simpler
 Simplicity enables higher performance at
lower cost
Dr. Yahya Tashtoush

Arithmetic Example
 C code:
f = (g + h) - (i + j);
 Compiled MIPS code:

add t0, g, h # temp t0 = g + h
add t1, i, j # temp t1 = i + j
sub f, t0, t1 # f = t0 - t1
Dr. Yahya Tashtoush

§2.3 Operands of the Computer Hardware
Register Operands
 Arithmetic instructions use register
operands
 MIPS has a 32 × 32-bit register file
 Use for frequently accessed data
 Numbered 0 to 31
 32-bit data called a “word”
 Assembler names
 $t0, $t1, …, $t9 for temporary values
 $s0, $s1, …, $s7 for saved variables
 Design Principle 2: Smaller is faster
 c.f. main memory: millions of locations
Dr. Yahya Tashtoush

Register Operand Example
 C code:
f = (g + h) - (i + j);
 f, …, j in $s0, …, $s4

add $t0, $s1, $s2
add $t1, $s3, $s4
sub $s0, $t0, $t1
Dr. Yahya Tashtoush

Memory Operands
 Main memory used for composite data
 Arrays, structures, dynamic data
 To apply arithmetic operations
 Load values from memory into registers
 Store result from register to memory
 Memory is byte addressed
 Each address identifies an 8-bit byte
 Words are aligned in memory
 Address must be a multiple of 4
 MIPS is Big Endian
 Most-significant byte at least address of a word
 c.f. Little Endian: least-significant byte at least address
Dr. Yahya Tashtoush

Machine Language - Load Instruction
 Load/Store Instruction Format (I format):
lw $t0, 24($s3)
35 19 8 2410
Memory
2410 + $s3 = 0xf f f f f f f f
. . . 0001 1000 0x120040ac

$t0
+ . . . 1001 0100
. . . 1010 1100 = $s3 0x12004094
0x120040ac
0x0000000c
0x00000008
0x00000004
0x00000000
Dr. Yahya Tashtoush
data word address (hex)
Byte Addresses
 Since 8-bit bytes are so useful, most architectures
address individual bytes in memory
 Alignment restriction - the memory address of a
word must be on natural word boundaries (a

multiple of 4 in MIPS-32)
 Big Endian: leftmost byte is word address
IBM 360/370, Motorola 68k, MIPS, Sparc, HP PA
 Little Endian: rightmost byte is word address
Intel 80x86, DEC Vax, DEC Alpha (Windows NT)
little endian byte 0
3 2 1 0
msb lsb
0 1 2 3
Big Endian byte 0
Dr. Yahya Tashtoush
Memory Operand Example 1
 C code:
g = h + A[8];
 g in $s1, h in $s2, base address of A in $s3

 Index 8 requires offset of 32
 4 bytes per word
lw $t0, 32($s3) # load word
add $s1, $s2, $t0
offset base register
Dr. Yahya Tashtoush

Memory Operand Example 2
 C code:
A[12] = h + A[8];
 h in $s2, base address of A in $s3

 Index 8 requires offset of 32
lw $t0, 32($s3) # load word
add $t0, $s2, $t0
sw $t0, 48($s3) # store word
Dr. Yahya Tashtoush

Registers vs. Memory
 Registers are faster to access than
memory
 Operating on memory data requires loads
and stores
 More instructions to be executed
 Compiler must use registers for variables
as much as possible
 Only spill to memory for less frequently used
variables
 Register optimization is important!
Dr. Yahya Tashtoush

Immediate Operands
 Constant data specified in an instruction
addi $s3, $s3, 4
 No subtract immediate instruction
 Just use a negative constant
addi $s2, $s1, -1
 Design Principle 3: Make the common
case fast
 Small constants are common
 Immediate operand avoids a load instruction
Dr. Yahya Tashtoush

The Constant Zero
 MIPS register 0 ($zero) is the constant 0
 Cannot be overwritten
 Useful for common operations
 E.g., move between registers
add $t2, $s1, $zero
Dr. Yahya Tashtoush

Review: Unsigned Binary Representation
Hex Binary Decimal

0x00000000 0…0000 0
0x00000001 0…0001 1 231 230 229 ... 23 22 21 20 bit weight
0x00000002 0…0010 2
31 30 29 ... 3 2 1 0 bit position
0x00000003 0…0011 3
0x00000004 0…0100 4 1 1 1 ... 1 1 1 1 bit
0x00000005 0…0101 5
0x00000006 0…0110 6
1 0 0 0 ... 0 0 0 0 - 1
0x00000007 0…0111 7
0x00000008 0…1000 8
0x00000009 0…1001 9
… 232 - 1
0xFFFFFFFC 1…1100 232 - 4
0xFFFFFFFD 1…1101 232 - 3
0xFFFFFFFE 1…1110 232 - 2
0xFFFFFFFF 1…1111 232 - 1
Review: Signed Binary Representation
2’sc binary decimal
-23 = 1000 -8
-(23 - 1) = 1001 -7
1010 -6
1011 -5
complement all the bits 1100 -4
1011 1101 -3
0101
1110 -2
and add a 1 1111 -1
and add a 1
0000 0
1010 0001 1
0110
0010 2
complement all the bits 0011 3
0100 4
0101 5
0110 6
23 - 1 = 0111 7
2s-Complement Signed Integers
 Given an n-bit number
n1 n2
x  xn12  x n 2 2    x12  x0 2
1 0
 Range: –2n – 1 to +2n – 1 – 1

 Example
 1111 1111 1111 1111 1111 1111 1111 11002
= –1×231 + 1×230 + … + 1×22 +0×21 +0×20
= –2,147,483,648 + 2,147,483,644 = –410
 Using 32 bits
 –2,147,483,648 to +2,147,483,647
Dr. Yahya Tashtoush

2s-Complement Signed Integers
 Bit 31 is sign bit
 1 for negative numbers
 0 for non-negative numbers
 –(–2n – 1) can’t be represented
 Non-negative numbers have the same unsigned
and 2s-complement representation
 Some specific numbers
 0: 0000 0000 … 0000
 –1: 1111 1111 … 1111
 Most-negative: 1000 0000 … 0000
 Most-positive: 0111 1111 … 1111
Dr. Yahya Tashtoush

Signed Negation
 Complement and add 1
 Complement means 1 → 0, 0 → 1
x  x  1111...1112  1
x  1  x
 Example: negate +2
 +2 = 0000 0000 … 00102
 –2 = 1111 1111 … 11012 + 1
= 1111 1111 … 11102
Dr. Yahya Tashtoush

Sign Extension
 Representing a number using more bits
 Preserve the numeric value
 In MIPS instruction set
 addi: extend immediate value
 lb, lh: extend loaded byte/halfword
 beq, bne: extend the displacement
 Replicate the sign bit to the left
 c.f. unsigned values: extend with 0s
 Examples: 8-bit to 16-bit
 +2: 0000 0010 => 0000 0000 0000 0010
 –2: 1111 1110 => 1111 1111 1111 1110
Dr. Yahya Tashtoush

§2.5 Representing Instructions in the Computer
Representing Instructions
 Instructions are encoded in binary
 Called machine code
 MIPS instructions
 Encoded as 32-bit instruction words
 Small number of formats encoding operation code
(opcode), register numbers, …
 Regularity!
 Register numbers
 $t0 – $t7 are reg’s 8 – 15
 $t8 – $t9 are reg’s 24 – 25
 $s0 – $s7 are reg’s 16 – 23
Dr. Yahya Tashtoush

MIPS R-format Instructions
6 bits 5 bits 5 bits 5 bits 5 bits 6 bits
 Instruction fields
 op: operation code (opcode)
 rs: first source register number
 rt: second source register number
 rd: destination register number
 shamt: shift amount (00000 for now)
 funct: function code (extends opcode)
Dr. Yahya Tashtoush

R-format Example
add $t0, $s1, $s2

special $s1 $s2 $t0 0 add
0 17 18 8 0 32
000000 10001 10010 01000 00000 100000
000000100011001001000000001000002 = 0232402016
Dr. Yahya Tashtoush

Hexadecimal
 Base 16
 Compact representation of bit strings
 4 bits per hex digit
0 0000 4 0100 8 1000 c 1100

1 0001 5 0101 9 1001 d 1101
2 0010 6 0110 a 1010 e 1110
3 0011 7 0111 b 1011 f 1111
 Example: eca8 6420

 1110 1100 1010 1000 0110 0100 0010 0000
Dr. Yahya Tashtoush

MIPS I-format Instructions
op rs rt constant or address
6 bits 5 bits 5 bits 16 bits
 Immediate arithmetic and load/store instructions

 rt: destination or source register number
 Constant: –215 to +215 – 1
 Address: offset added to base address in rs
 Design Principle 4: Good design demands good
compromises
 Different formats complicate decoding, but allow 32-bit
instructions uniformly
 Keep formats as similar as possible
Dr. Yahya Tashtoush

Stored Program Computers
The BIG Picture  Instructions represented in
binary, just like data
 Instructions and data stored
in memory
 Programs can operate on
programs
 e.g., compilers, linkers, …
 Binary compatibility allows
compiled programs to work
on different computers
 Standardized ISAs
Dr. Yahya Tashtoush

§2.6 Logical Operations
Logical Operations
 Instructions for bitwise manipulation
Operation C Java MIPS
Shift left << << sll
Shift right >> >>> srl
Bitwise AND & & and, andi
Bitwise OR | | or, ori
Bitwise NOT ~ ~ nor
 Useful for extracting and inserting

groups of bits in a word
Dr. Yahya Tashtoush
Shift Operations
 shamt: how many positions to shift

 Shift left logical
 Shift left and fill with 0 bits
 sll by i bits multiplies by 2i
 Shift right logical
 Shift right and fill with 0 bits
 srl by i bits divides by 2i (unsigned only)
Dr. Yahya Tashtoush

AND Operations
 Useful to mask bits in a word
 Select some bits, clear others to 0
and $t0, $t1, $t2
$t2 0000 0000 0000 0000 0000 1101 1100 0000
$t1 0000 0000 0000 0000 0011 1100 0000 0000
$t0 0000 0000 0000 0000 0000 1100 0000 0000
Dr. Yahya Tashtoush

OR Operations
 Useful to include bits in a word
 Set some bits to 1, leave others unchanged
or $t0, $t1, $t2
$t2 0000 0000 0000 0000 0000 1101 1100 0000
$t1 0000 0000 0000 0000 0011 1100 0000 0000
$t0 0000 0000 0000 0000 0011 1101 1100 0000
Dr. Yahya Tashtoush

NOT Operations
 Useful to invert bits in a word
 Change 0 to 1, and 1 to 0
 MIPS has NOR 3-operand instruction
 a NOR b == NOT ( a OR b )
nor $t0, $t1, $zero Register 0: always
read as zero
$t1 0000 0000 0000 0000 0011 1100 0000 0000
$t0 1111 1111 1111 1111 1100 0011 1111 1111
Dr. Yahya Tashtoush

§2.7 Instructions for Making Decisions
Conditional Operations
 Branch to a labeled instruction if a
condition is true
 Otherwise, continue sequentially
 beq rs, rt, L1
 if (rs == rt) branch to instruction labeled L1;
 bne rs, rt, L1
 if (rs != rt) branch to instruction labeled L1;
 j L1
 unconditional jump to instruction labeled L1
Dr. Yahya Tashtoush

Specifying Branch Destinations
 Use a register (like in lw and sw) added to the 16-bit offset
 which register? Instruction Address Register (the PC)
 its use is automatically implied by instruction
 PC gets updated (PC+4) during the fetch cycle so that it holds the
address of the next instruction
 limits the branch distance to -215 to +215-1 (word) instructions from
the (instruction after the) branch instruction, but most branches are
local anyway
from the low order 16 bits of the branch instruction
16
offset
sign-extend
00
branch dst
32 32 Add address
PC 32 Add 32
32
32 4 32 ?
Dr. Yahya Tashtoush

Compiling If Statements
 C code:
if (i==j) f = g+h;
else f = g-h;
 f, g, … in $s0, $s1, …
bne $s3, $s4, Else
add $s0, $s1, $s2
j Exit
Else: sub $s0, $s1, $s2
Exit: …
Assembler calculates addresses
Dr. Yahya Tashtoush

Compiling Loop Statements
 C code:
while (save[i] == k) i += 1;
 i in $s3, k in $s5, address of save in $s6
Loop: sll $t1, $s3, 2
add $t1, $t1, $s6
lw $t0, 0($t1)
bne $t0, $s5, Exit
addi $s3, $s3, 1
j Loop
Exit: …
Dr. Yahya Tashtoush

More Conditional Operations
 Set result to 1 if a condition is true
 Otherwise, set to 0
 slt rd, rs, rt
 if (rs < rt) rd = 1; else rd = 0;
 slti rt, rs, constant
 if (rs < constant) rt = 1; else rt = 0;
 Use in combination with beq, bne
slt $t0, $s1, $s2 # if ($s1 < $s2)
bne $t0, $zero, L # branch to L
Dr. Yahya Tashtoush

Branch Instruction Design
 Why not blt, bge, etc?
 Hardware for <, ≥, … slower than =, ≠
 Combining with branch involves more work
per instruction, requiring a slower clock
 All instructions penalized!
 beq and bne are the common case
 This is a good design compromise
Dr. Yahya Tashtoush

Signed vs. Unsigned
 Signed comparison: slt, slti
 Unsigned comparison: sltu, sltui
 Example
 $s0 = 1111 1111 1111 1111 1111 1111 1111 1111
 $s1 = 0000 0000 0000 0000 0000 0000 0000 0001
 slt $t0, $s0, $s1 # signed
 –1 < +1  $t0 = 1
 sltu $t0, $s0, $s1 # unsigned
 +4,294,967,295 > +1  $t0 = 0
Dr. Yahya Tashtoush

Register Usage
 $a0 – $a3: arguments (reg’s 4 – 7)
 $v0, $v1: result values (reg’s 2 and 3)
 $t0 – $t9: temporaries
 Can be overwritten by callee
 $s0 – $s7: saved
 Must be saved/restored by callee
 $gp: global pointer for static data (reg 28)
 $sp: stack pointer (reg 29)
 $fp: frame pointer (reg 30)
 $ra: return address (reg 31)
Dr. Yahya Tashtoush

Procedure Call Instructions
 Procedure call: jump and link
jal ProcedureLabel
 Address of following instruction put in $ra
 Jumps to target address
 Procedure return: jump register

jr $ra
 Copies $ra to program counter
 Can also be used for computed jumps
 e.g., for case/switch statements
Dr. Yahya Tashtoush

§2.10 MIPS Addressing for 32-Bit Immediates and Addresses
32-bit Constants
 Most constants are small
 16-bit immediate is sufficient
 For the occasional 32-bit constant
lui rt, constant
 Copies 16-bit constant to left 16 bits of rt
 Clears right 16 bits of rt to 0
lhi $s0, 61 0000 0000 0111 1101 0000 0000 0000 0000
ori $s0, $s0, 2304 0000 0000 0111 1101 0000 1001 0000 0000
Dr. Yahya Tashtoush

Branch Addressing
 Branch instructions specify
 Opcode, two registers, target address
 Most branch targets are near branch
 Forward or backward
op rs rt constant or address
6 bits 5 bits 5 bits 16 bits
 PC-relative addressing
 Target address = PC + offset × 4
 PC already incremented by 4 by this time
Dr. Yahya Tashtoush
Jump Control Flow Instructions
 MIPS also has an unconditional branch instruction or
jump instruction:
j label #go to label
 Instruction Format (J Format):

0x02 26-bit address
from the low order 26 bits of the jump instruction

26
00
32
4
PC 32
Dr. Yahya Tashtoush

Target Addressing Example
 Loop code from earlier example
 Assume Loop at location 80000
Loop: sll $t1, $s3, 2 80000 0 0 19 9 4 0

add $t1, $t1, $s6 80004 0 9 22 9 0 32
lw $t0, 0($t1) 80008 35 9 8 0
bne $t0, $s5, Exit 80012 5 8 21 2
addi $s3, $s3, 1 80016 8 19 19 1
j Loop 80020 2 20000
Exit: … 80024
Dr. Yahya Tashtoush

Branching Far Away
 If branch target is too far to encode with
16-bit offset, assembler rewrites the code
 Example
beq $s0,$s1, L1
↓
bne $s0,$s1, L2
j L1
L2: …
Dr. Yahya Tashtoush

Addressing Mode Summary
Dr. Yahya Tashtoush

§2.12 Translating and Starting a Program
Translation and Startup
Many compilers produce

object modules directly
Static linking
Dr. Yahya Tashtoush

Assembler Pseudoinstructions
 Most assembler instructions represent
machine instructions one-to-one
 Pseudoinstructions: figments of the
assembler’s imagination
move $t0, $t1 → add $t0, $zero, $t1
blt $t0, $t1, L → slt $at, $t0, $t1
bne $at, $zero, L
Dr. Yahya Tashtoush

§2.13 A C Sort Example to Put It All Together
C Sort Example
 Illustrates use of assembly instructions
for a C bubble sort function
 Swap procedure (leaf)
void swap(int v[], int k)
{
int temp;
temp = v[k];
v[k] = v[k+1];
v[k+1] = temp;
}
 v in $a0, k in $a1, temp in $t0
Dr. Yahya Tashtoush

The Procedure Swap
swap: sll $t1, $a1, 2 # $t1 = k * 4
add $t1, $a0, $t1 # $t1 = v+(k*4)
# (address of v[k])
lw $t0, 0($t1) # $t0 (temp) = v[k]
lw $t2, 4($t1) # $t2 = v[k+1]
sw $t2, 0($t1) # v[k] = $t2 (v[k+1])
sw $t0, 4($t1) # v[k+1] = $t0 (temp)
jr $ra # return to calling routine
Dr. Yahya Tashtoush

§2.17 Real Stuff: x86 Instructions
The Intel x86 ISA
 Evolution with backward compatibility
 8080 (1974): 8-bit microprocessor
 Accumulator, plus 3 index-register pairs
 8086 (1978): 16-bit extension to 8080
 Complex instruction set (CISC)
 8087 (1980): floating-point coprocessor
 Adds FP instructions and register stack
 80286 (1982): 24-bit addresses, MMU
 Segmented memory mapping and protection
 80386 (1985): 32-bit extension (now IA-32)
 Additional addressing modes and operations
 Paged memory mapping as well as segments
Dr. Yahya Tashtoush

The Intel x86 ISA
 Further evolution…
 i486 (1989): pipelined, on-chip caches and FPU
 Compatible competitors: AMD, Cyrix, …
 Pentium (1993): superscalar, 64-bit datapath
 Later versions added MMX (Multi-Media eXtension)
instructions
 The infamous FDIV bug
 Pentium Pro (1995), Pentium II (1997)
 New microarchitecture (see Colwell, The Pentium Chronicles)
 Pentium III (1999)
 Added SSE (Streaming SIMD Extensions) and associated
registers
 Pentium 4 (2001)
 New microarchitecture
 Added SSE2 instructions
Dr. Yahya Tashtoush

The Intel x86 ISA
 And further…
 AMD64 (2003): extended architecture to 64 bits
 EM64T – Extended Memory 64 Technology (2004)
 AMD64 adopted by Intel (with refinements)
 Added SSE3 instructions
 Intel Core (2006)
 Added SSE4 instructions, virtual machine support
 AMD64 (announced 2007): SSE5 instructions
 Intel declined to follow, instead…
 Advanced Vector Extension (announced 2008)
 Longer SSE registers, more instructions
 If Intel didn’t extend with compatibility, its
competitors would!
 Technical elegance ≠ market success
Dr. Yahya Tashtoush

Basic x86 Registers
Dr. Yahya Tashtoush

x86 Instruction Encoding
 Variable length
encoding
 Postfix bytes specify
addressing mode
 Prefix bytes modify
operation
 Operand length,
repetition, locking, …
Dr. Yahya Tashtoush

Pitfalls
 Sequential words are not at sequential
addresses
 Increment by 4, not by 1!
Dr. Yahya Tashtoush

§2.19 Concluding Remarks
Concluding Remarks
 Design principles
1. Simplicity favors regularity
2. Smaller is faster
3. Make the common case fast
4. Good design demands good compromises
Dr. Yahya Tashtoush

Concluding Remarks
 Measure MIPS instruction executions in
benchmark programs
 Consider making the common case fast
 Consider compromises
Instruction class MIPS examples SPEC2006 Int SPEC2006 FP
Arithmetic add, sub, addi 16% 48%
Data transfer lw, sw, lb, lbu, 35% 36%
lh, lhu, sb, lui
Logical and, or, nor, andi, 12% 4%
ori, sll, srl
Cond. Branch beq, bne, slt, 34% 8%
slti, sltiu
Jump j, jr, jal 2% 0%
Dr. Yahya Tashtoush

MIPS Organization So Far
Processor
Memory
Register File
1…1100
src1 addr src1
5 32 data
src2 addr 32
5 registers
dst addr ($zero - $ra) read/write
5 src2
write data addr
32 32 data 230
32 words
32 bits
branch offset
32 Add
read data
32 PC 32 Add 32 32
32
4 write data 0…1100
Fetch 32
PC = PC+4 32 0…1000
4 5 6 7 0…0100
32 ALU 0 1 2 3
Exec Decode 0…0000
32
32 bits word address
32
(binary)
byte address
(big Endian)
Dr. Yahya Tashtoush

Chapter 4
The Processor
§4.1 Introduction
Introduction
 CPU performance factors
 Instruction count
 Determined by ISA and compiler
 CPI and Cycle time
 Determined by CPU hardware
 We will examine two MIPS implementations
 A simplified version
 A more realistic pipelined version
 Simple subset, shows most aspects
 Memory reference: lw, sw
 Arithmetic/logical: add, sub, and, or, slt
 Control transfer: beq, j
Chapter 4 — The Processor — 2

Instruction Execution
 PC  instruction memory, fetch instruction
 Register numbers  register file, read registers
 Depending on instruction class
 Use ALU to calculate
 Arithmetic result
 Memory address for load/store
 Branch target address
 Access data memory for load/store
 PC  target address or PC + 4

CPU Overview

Multiplexers
 Can’t just join
wires together
 Use multiplexers

Control

§4.2 Logic Design Conventions
Logic Design Basics
 Information encoded in binary
 Low voltage = 0, High voltage = 1
 One wire per bit
 Multi-bit data encoded on multi-wire buses
 Combinational element
 Operate on data
 Output is a function of input
 State (sequential) elements
 Store information

Combinational Elements
 AND-gate  Adder A
Y
+
 Y=A&B  Y=A+B B
A
Y
B
 Arithmetic/Logic Unit
 Multiplexer  Y = F(A, B)
 Y = S ? I1 : I0
A
I0 M
u Y ALU Y
I1 x
B
S F

Sequential Elements
 Register: stores data in a circuit
 Uses a clock signal to determine when to
update the stored value
 Edge-triggered: update when Clk changes
from 0 to 1
Clk
D Q
D
Clk
Q

Sequential Elements
 Register with write control
 Only updates on clock edge when write
control input is 1
 Used when stored value is required later
Clk
D Q Write
Write D
Clk
Q

Clocking Methodology
 Combinational logic transforms data during
clock cycles
 Between clock edges
 Input from state elements, output to state
element
 Longest delay determines clock period

§4.3 Building a Datapath
Building a Datapath
 Datapath
 Elements that process data and addresses
in the CPU
 Registers, ALUs, mux’s, memories, …
 We will build a MIPS datapath
incrementally
 Refining the overview design

Instruction Fetch
Increment by
4 for next
32-bit instruction
register

R-Format Instructions
 Read two register operands
 Perform arithmetic/logical operation
 Write register result

Load/Store Instructions
 Read register operands
 Calculate address using 16-bit offset
 Use ALU, but sign-extend offset
 Load: Read memory and update register
 Store: Write register value to memory

Branch Instructions
 Read register operands
 Compare operands
 Use ALU, subtract and check Zero output
 Calculate target address
 Sign-extend displacement
 Shift left 2 places (word displacement)
 Add to PC + 4
 Already calculated by instruction fetch

Branch Instructions
Just
re-routes
wires
Sign-bit wire
replicated

Composing the Elements
 First-cut data path does an instruction in
one clock cycle
 Each datapath element can only do one
function at a time
 Hence, we need separate instruction and data
memories
 Use multiplexers where alternate data
sources are used for different instructions

R-Type/Load/Store Datapath

Full Datapath

§4.4 A Simple Implementation Scheme
ALU Control
 ALU used for
 Load/Store: F = add
 Branch: F = subtract
 R-type: F depends on funct field
ALU control Function
0000 AND
0001 OR
0010 add
0110 subtract
0111 set-on-less-than
1100 NOR

ALU Control
 Assume 2-bit ALUOp derived from opcode
 Combinational logic derives ALU control
opcode ALUOp Operation funct ALU function ALU control

lw 00 load word XXXXXX add 0010
sw 00 store word XXXXXX add 0010
beq 01 branch equal XXXXXX subtract 0110
R-type 10 add 100000 add 0010
subtract 100010 subtract 0110
AND 100100 AND 0000
OR 100101 OR 0001
set-on-less-than 101010 set-on-less-than 0111

The Main Control Unit
 Control signals derived from instruction
R-type 0 rs rt rd shamt funct

31:26 25:21 20:16 15:11 10:6 5:0
Load/
35 or 43 rs rt address
Store
31:26 25:21 20:16 15:0
Branch 4 rs rt address
31:26 25:21 20:16 15:0
opcode always read, write for sign-extend

read except R-type and add
for load and load

Datapath With Control

R-Type Instruction

Load Instruction

Branch-on-Equal Instruction

Implementing Jumps
Jump 2 address
31:26 25:0
 Jump uses word address

 Update PC with concatenation of
 Top 4 bits of old PC
 26-bit jump address
 00
 Need an extra control signal decoded from
opcode
Datapath With Jumps Added

Performance Issues
 Longest delay determines clock period
 Critical path: load instruction
 Instruction memory  register file  ALU 
data memory  register file
 Not feasible to vary period for different
instructions
 Violates design principle
 Making the common case fast
 We will improve performance by pipelining

§4.5 An Overview of Pipelining
Pipelining Analogy
 Pipelined laundry: overlapping execution
 Parallelism improves performance
 Four loads:
 Speedup
= 8/3.5 = 2.3
 Non-stop:
 Speedup
= 2n/0.5n + 1.5 ≈ 4
= number of stages

MIPS Pipeline
 Five stages, one step per stage
1. IF: Instruction fetch from memory
2. ID: Instruction decode & register read
3. EX: Execute operation or calculate address
4. MEM: Access memory operand
5. WB: Write result back to register

Pipeline Performance
 Assume time for stages is
 100ps for register read or write
 200ps for other stages
 Compare pipelined datapath with single-cycle
datapath
Instr Instr fetch Register ALU op Memory Register Total time

read access write
lw 200ps 100 ps 200ps 200ps 100 ps 800ps
sw 200ps 100 ps 200ps 200ps 700ps
R-format 200ps 100 ps 200ps 100 ps 600ps
beq 200ps 100 ps 200ps 500ps

Pipeline Performance
Single-cycle (Tc= 800ps)
Pipelined (Tc= 200ps)

Pipeline Speedup
 If all stages are balanced
 i.e., all take the same time
 Time between instructionspipelined
= Time between instructionsnonpipelined
Number of stages
 If not balanced, speedup is less
 Speedup due to increased throughput
 Latency (time for each instruction) does not
decrease

Pipelining and ISA Design
 MIPS ISA designed for pipelining
 All instructions are 32-bits
 Easier to fetch and decode in one cycle
 c.f. x86: 1- to 17-byte instructions
 Few and regular instruction formats
 Can decode and read registers in one step
 Load/store addressing
 Can calculate address in 3rd stage, access memory
in 4th stage
 Alignment of memory operands
 Memory access takes only one cycle

Hazards
 Situations that prevent starting the next
instruction in the next cycle
 Structure hazards
 A required resource is busy
 Data hazard
 Need to wait for previous instruction to
complete its data read/write
 Control hazard
 Deciding on control action depends on
previous instruction

Structure Hazards
 Conflict for use of a resource
 In MIPS pipeline with a single memory
 Load/store requires data access
 Instruction fetch would have to stall for that
cycle
 Would cause a pipeline “bubble”
 Hence, pipelined datapaths require
separate instruction/data memories
 Or separate instruction/data caches

Data Hazards
 An instruction depends on completion of
data access by a previous instruction
 add $s0, $t0, $t1
sub $t2, $s0, $t3

Forwarding (aka Bypassing)
 Use result when it is computed
 Don’t wait for it to be stored in a register
 Requires extra connections in the datapath

Load-Use Data Hazard
 Can’t always avoid stalls by forwarding
 If value not computed when needed
 Can’t forward backward in time!

Code Scheduling to Avoid Stalls
 Reorder code to avoid use of load result in
the next instruction
 C code for A = B + E; C = B + F;
lw $t1, 0($t0) lw $t1, 0($t0)

lw $t2, 4($t0) lw $t2, 4($t0)
stall add $t3, $t1, $t2 lw $t4, 8($t0)
sw $t3, 12($t0) add $t3, $t1, $t2
lw $t4, 8($t0) sw $t3, 12($t0)
stall add $t5, $t1, $t4 add $t5, $t1, $t4
sw $t5, 16($t0) sw $t5, 16($t0)
13 cycles 11 cycles

Control Hazards
 Branch determines flow of control
 Fetching next instruction depends on branch
outcome
 Pipeline can’t always fetch correct instruction
 Still working on ID stage of branch
 In MIPS pipeline
 Need to compare registers and compute
target early in the pipeline
 Add hardware to do it in ID stage

Stall on Branch
 Wait until branch outcome determined
before fetching next instruction

Branch Prediction
 Longer pipelines can’t readily determine
branch outcome early
 Stall penalty becomes unacceptable
 Predict outcome of branch
 Only stall if prediction is wrong
 In MIPS pipeline
 Can predict branches not taken
 Fetch instruction after branch, with no delay

MIPS with Predict Not Taken
Prediction
correct
Prediction
incorrect

More-Realistic Branch Prediction
 Static branch prediction
 Based on typical branch behavior
 Example: loop and if-statement branches
 Predict backward branches taken
 Predict forward branches not taken
 Dynamic branch prediction
 Hardware measures actual branch behavior
 e.g., record recent history of each branch
 Assume future behavior will continue the trend
 When wrong, stall while re-fetching, and update history

Pipeline Summary
The BIG Picture
 Pipelining improves performance by

increasing instruction throughput
 Executes multiple instructions in parallel
 Each instruction has the same latency
 Subject to hazards
 Structure, data, control
 Instruction set design affects complexity of
pipeline implementation
§4.6 Pipelined Datapath and Control
MIPS Pipelined Datapath
MEM
Right-to-left WB
flow leads to
hazards

Pipeline registers
 Need registers between stages
 To hold information produced in previous cycle

Pipeline Operation
 Cycle-by-cycle flow of instructions through
the pipelined datapath
 “Single-clock-cycle” pipeline diagram
 Shows pipeline usage in a single cycle
 Highlight resources used
 c.f. “multi-clock-cycle” diagram
 Graph of operation over time
 We’ll look at “single-clock-cycle” diagrams
for load & store

IF for Load, Store, …

ID for Load, Store, …

EX for Load

MEM for Load

WB for Load
Wrong
register
number

Corrected Datapath for Load

EX for Store

MEM for Store

WB for Store

Multi-Cycle Pipeline Diagram
 Form showing resource usage

Multi-Cycle Pipeline Diagram
 Traditional form

Single-Cycle Pipeline Diagram
 State of pipeline in a given cycle

Pipelined Control (Simplified)

Pipelined Control
 Control signals derived from instruction
 As in single-cycle implementation

Pipelined Control

§4.7 Data Hazards: Forwarding vs. Stalling
Data Hazards in ALU Instructions
 Consider this sequence:
sub $2, $1,$3
and $12,$2,$5
or $13,$6,$2
add $14,$2,$2
sw $15,100($2)
 We can resolve hazards with forwarding
 How do we detect when to forward?

Dependencies & Forwarding

Detecting the Need to Forward
 Pass register numbers along pipeline
 e.g., ID/EX.RegisterRs = register number for Rs
sitting in ID/EX pipeline register
 ALU operand register numbers in EX stage
are given by
 ID/EX.RegisterRs, ID/EX.RegisterRt
 Data hazards when
Fwd from
1a. EX/MEM.RegisterRd = ID/EX.RegisterRs EX/MEM
pipeline reg
1b. EX/MEM.RegisterRd = ID/EX.RegisterRt
2a. MEM/WB.RegisterRd = ID/EX.RegisterRs Fwd from
MEM/WB
2b. MEM/WB.RegisterRd = ID/EX.RegisterRt pipeline reg

Detecting the Need to Forward
 But only if forwarding instruction will write
to a register!
 EX/MEM.RegWrite, MEM/WB.RegWrite
 And only if Rd for that instruction is not
$zero
 EX/MEM.RegisterRd ≠ 0,
MEM/WB.RegisterRd ≠ 0

Forwarding Paths

Forwarding Conditions
 EX hazard
 if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs))
ForwardA = 10
 if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
ForwardB = 10
 MEM hazard
 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)
and (MEM/WB.RegisterRd = ID/EX.RegisterRs))
ForwardA = 01
and (MEM/WB.RegisterRd = ID/EX.RegisterRt))
ForwardB = 01

Double Data Hazard
 Consider the sequence:
add $1,$1,$2
add $1,$1,$3
add $1,$1,$4
 Both hazards occur
 Want to use the most recent
 Revise MEM hazard condition
 Only fwd if EX hazard condition isn’t true

Revised Forwarding Condition
 MEM hazard
and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs))
and (MEM/WB.RegisterRd = ID/EX.RegisterRs))
ForwardA = 01
and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
and (MEM/WB.RegisterRd = ID/EX.RegisterRt))
ForwardB = 01

Datapath with Forwarding

Load-Use Data Hazard
Need to stall
for one cycle

Load-Use Hazard Detection
 Check when using instruction is decoded
in ID stage
 ALU operand register numbers in ID stage
are given by
 IF/ID.RegisterRs, IF/ID.RegisterRt
 Load-use hazard when
 ID/EX.MemRead and
((ID/EX.RegisterRt = IF/ID.RegisterRs) or
(ID/EX.RegisterRt = IF/ID.RegisterRt))
 If detected, stall and insert bubble

How to Stall the Pipeline
 Force control values in ID/EX register
to 0
 EX, MEM and WB do nop (no-operation)
 Prevent update of PC and IF/ID register
 Using instruction is decoded again
 Following instruction is fetched again
 1-cycle stall allows MEM to read data for lw
 Can subsequently forward to EX stage

Stall/Bubble in the Pipeline
Stall inserted
here

Stall/Bubble in the Pipeline
Or, more
accurately…
Datapath with Hazard Detection

Stalls and Performance
The BIG Picture
 Stalls reduce performance

 But are required to get correct results
 Compiler can arrange code to avoid
hazards and stalls
 Requires knowledge of the pipeline structure

§4.8 Control Hazards
Branch Hazards
 If branch outcome determined in MEM
Flush these
instructions
(Set control
values to 0)
PC

Reducing Branch Delay
 Move hardware to determine outcome to ID
stage
 Target address adder
 Register comparator
 Example: branch taken
36: sub $10, $4, $8
40: beq $1, $3, 7
44: and $12, $2, $5
48: or $13, $2, $6
52: add $14, $4, $2
56: slt $15, $6, $7
...
72: lw $4, 50($7)

Example: Branch Taken

Example: Branch Taken

Data Hazards for Branches
 If a comparison register is a destination of
2nd or 3rd preceding ALU instruction
add $1, $2, $3 IF ID EX MEM WB
… IF ID EX MEM WB
beq $1, $4, target IF ID EX MEM WB
 Can resolve using forwarding

preceding ALU instruction or 2nd preceding
load instruction
 Need 1 stall cycle
lw $1, addr IF ID EX MEM WB
beq stalled IF ID
beq $1, $4, target ID EX MEM WB

immediately preceding load instruction
 Need 2 stall cycles
lw $1, addr IF ID EX MEM WB
beq stalled IF ID
beq stalled ID
beq $1, $0, target ID EX MEM WB

Dynamic Branch Prediction
 In deeper and superscalar pipelines, branch
penalty is more significant
 Use dynamic prediction
 Branch prediction buffer (aka branch history table)
 Indexed by recent branch instruction addresses
 Stores outcome (taken/not taken)
 To execute a branch
 Check table, expect the same outcome
 Start fetching from fall-through or target
 If wrong, flush pipeline and flip prediction

1-Bit Predictor: Shortcoming
 Inner loop branches mispredicted twice!
outer: …
…
inner: …
…
beq …, …, inner
…
beq …, …, outer
 Mispredict as taken on last iteration of

inner loop
 Then mispredict as not taken on first
iteration of inner loop next time around
2-Bit Predictor
 Only change prediction on two successive
mispredictions

Calculating the Branch Target
 Even with predictor, still need to calculate
the target address
 1-cycle penalty for a taken branch
 Branch target buffer
 Cache of target addresses
 Indexed by PC when instruction fetched
 If hit and instruction is branch predicted taken, can
fetch target immediately

Chapter Five
©2004 Morgan Kaufmann Publishers 1

The Processor: Datapath & Control
• We're ready to look at an implementation of the MIPS

• Simplified to contain only:
– memory-reference instructions: lw, sw
– arithmetic-logical instructions: add, sub, and, or, slt
– control flow instructions: beq, j
• Generic Implementation:
– use the program counter (PC) to supply instruction address
– get the instruction from memory
– read registers
– use the instruction to decide exactly what to do
• All instructions use the ALU after reading the registers
Why? memory-reference? arithmetic? control flow?

Overview of the Implementation
• For every instruction, the first two steps are identical:

1. Send the PC to the memory that contains the code and
fetch the instruction from that memory.
2. Read 1 or 2 registers, using fields of the instruction to
select the registers to read. For load word instruction we
need to read only 1 register, but most other instructions
require that we read 2 registers.
• After these two steps, the actions required to complete the
instruction depend on the instruction class (memory-reference,
arithmetic-logical, and branches).
• All instruction classes use the ALU after reading the registers:
– Memory-reference instructions use ALU for an address
calculation.
– Arithmetic-logical instructions use ALU for operation
execution.
– Branch instructions use ALU for comparison
Overview of the Implementation
• After using ALU, the actions required to complete the

different instruction classes differ:
– A memory-reference instruction will need to
access the memory either to write data for a store
or read data for a load.
– An arithmetic-logical instruction must write the
data from the ALU back into a register.
– A branch instruction may need to change the next
instruction address based on the comparison.

More Implementation Details
• Abstract / Simplified View:
Add Add
Data
Register #
PC Address Instruction Registers ALU Address
Register # Data
Instruction
memory
memory Register #
Data
Two types of functional units:

– elements that operate on data values (combinational)
– elements that contain state (sequential)

Implementation Details of MIPS
• The functional units in MIPS implementation consist of

two different types of logic elements:
1. Elements that operate on data values:
– These elements are all combinational, which means
that their outputs depend only on the current
inputs.
– Given the same input, a combinational element
always produces the same output, because it has
no internal storage.
– The ALU is a combinational element.

Implementation Details of MIPS
2. Elements that contain state:
– An element contains state if it has some internal storage.
– We call these elements state elements because, if we pulled the
plug on the machine, we could restart it by loading the state
elements with the values they contained before we pulled the plug.
– Logic components that contain state called sequential because
their outputs depend on both their inputs and the contents of the
internal state.
– The instruction and data memories as well as the registers are all
examples of state elements.
– A state element has at least two inputs and one output.
– The required inputs are the data value to be written into the
element, and the clock, which determines when the data value is
written.
– The output from a state element provides the value that was written
in an earlier clock cycle.

State Elements
• Unclocked vs. Clocked

• Clocks used in synchronous logic
– when should an element that contains state be updated?
Falling edge
Clock period Rising edge

cycle time

An unclocked state element
• The set-reset latch

– output depends on present inputs and also on past inputs
R
Q
Q
S

Latches and Flip-flops
• Output is equal to the stored value inside the element

(don't need to ask for permission to look at the value)
• Change of state (value) is based on the clock
• Latches: whenever the inputs change, and the clock is asserted
• Flip-flop: state changes only on a clock edge
(edge-triggered methodology)
"logically true",
— could mean electrically low
A clocking methodology defines when signals can be read and written

— wouldn't want to read a signal at the same time it was being written

D-latch
• Two inputs:
– the data value to be stored (D)
– the clock signal (C) indicating when to read & store D
• Two outputs:
– the value of the internal state (Q) and it's complement
C
D
Q
C
Q
_
Q
D

Clocking Methodology
• A clocking methodology defines when signals can be read

and when they can be written.
• If a signal is written at the same time it is read, the value of
the read could correspond to the old value, the newly written
value, or even some mix of the two!
• Assume an edge-triggered clocking methodology, which
means that any values stored in the machine are updated
only on a clock edge.
• The state elements all update their internal storage on the
clock edge.
• Because only state elements can store a data value, any
collection of combinational logic must have its inputs coming
from a set of state elements and its outputs written into a set
of state elements.
• The inputs are values that were written in a previous clock
cycle, while the outputs are values that can be used in the
following clock cycle.

D flip-flop
• Output changes only on the clock edge
D D Q D Q
D D Q
latch latch
Q
C C Q

Our Implementation
• An edge triggered methodology

• Typical execution:
– read contents of some state elements,
– send values through some combinational logic
– write results to one or more state elements
State State
element Combinational logic element
1 2
Clock cycle

Combinational logic, state elements, and clock are
closely related
• From the figure in the previous slide, the two state elements
surrounding a block of combinational logic, which operates
in a single clock cycle:
– All signals must propagate from state element 1, through
the combinational logic, and to state element 2 in the time
of one clock cycle.
– The time for signals to reach state element 2 defines the
length of the clock cycle.
• Both the clock signal and the write control signal are inputs,
and the state element is changed only when the write control
signal is asserted and a clock edge occurs.

Instruction Memory, Program Counter, and Adder
• First element we need is a place to store the

instructions of a program (a memory unit).
• To execute any instruction, we must start by fetching

the instruction from memory.
• To prepare for executing the next instruction, we

must increment the program counter (PC), so that it
points at the next instruction, 4 bytes later.
• Therefore, two state elements are needed to store

and access instructions, and an adder is needed to
compute the next instruction.

A portion of the datapath used for fetching
instructions and incrementing the PC
• The PC is a 32-bit register that will be written at the end of every

clock cycle.
• The adder is an ALU wired to always perform an add of its two
32-bit inputs and place the result on its output.
Add Sum
Instruction
PC address
Instruction
Instruction
memory

Register File
• R-format instructions (add, sub, slt, and, or) all

read two registers, perform an ALU operation on
the contents of the registers, and write the result.
• The processor’s 32 registers are stored in a
structure called a register file.
• A register file is a collection of registers in which
any register can be read or written by specifying
the number of the register in the file.
• R-format instructions have 3 register operands, we
will need to read two data words from the register
file and write one data word into the register file for
each instruction.

Register File
• For each data word to be read from the registers:

– We need an input to the register file that
specifies the register number to be read
– We need an output from the register file that will
carry the value that has been read from the
registers.
• To write a data word, we need two inputs:
– One to specify the register number to be written
– One to supply the data to be written into the
register.
• We need a total of 4 inputs (3 for register numbers
and 1 for data) and 2 outputs (both for data) as
shown in the next slide.
Register File
• Built using D flip-flops Read register

number 1
Register 0
Register 1
M
... u Read data 1
Read register x
number 1 Register n – 2
Read
Read register data 1 Register n – 1
number 2
Register file
Write Read
register data 2 Read register
number 2
Write
data Write
M
u Read data 2
x
Do you understand? What is the “Mux” above?

Register File
• Note: we still use the real clock to determine when to write
Write
C
0
1 Register 0
n-to-2n .. D
Register number .
decoder
C
Register 1
n–1
D
n
.
..
C
Register n – 2
D
C
Register n – 1
Register data D

Register File and ALU
• The register number inputs are 5 bits wide to
specify one of the 32 registers (32=25),
whereas the data input and the two output
buses are each 32 bits wide.
• ALU is controlled by 4-bit signal and it takes

two 32-bit inputs and produces a 32-bit result
as shown in the next slide.

Register File and ALU
ALU control
5 Read 4
register 1
Read
Register 5 data 1
Read
numbers register 2 Zero
Registers Data ALU ALU
5 Write result
register
Read
Write data 2
Data data
RegWrite
a. Registers b. ALU

The datapath for R-type instructions
• Two elements needed to implement R-format ALU operations are

register file and ALU
A L U c o n t r o l
5 4
R e a d
re g is te r 1
R e a d
d a ta 1
5
R e g is t e r R e a d
re g is te r 2 Z e r o
n u m b e rs
Instruction R e g is t e r s D a ta A L U
A L U
5
W r ite
r e s u lt
re g is te r
R e a d
d a ta 2
W r ite
D a ta
d a ta
R e g W rite

Simple Implementation
• Include the functional units we need for each instruction
Instruction MemWrite
address
Read
Instruction PC Add Sum Address data
16 32
Instruction Sign
memory Data extend
Write memory
a. Instruction memory b. Program counter c. Adder data
MemRead
a. Data memory unit b. Sign-extension unit
5 Read ALU operation

register 1 4
Read
Register 5 data 1
Read
numbers register 2 Zero
Data ALU ALU
5 Registers
Write result
register Read
Write
data 2 Why do we need this stuff?
Data
Data
RegWrite
a. Registers b. ALU

Building the Datapath
• Use multiplexors to stitch them together
PCSrc
M
Add u
x
ALU
4 Add
result
Shift
left 2
Read ALUSrc ALU operation

Read register 1 4
PC address Read MemWrite
data 1
Read MemtoReg
register 2 Zero
Instruction ALU ALU
Registers Read Read
Write Address
data 2 result data M
Instruction register M
memory u u
x x
Write
data Data
Write memory
RegWrite data
16 32 MemRead
Sign
extend

MIPS load word and store word instructions
• Load word instruction: lw $t1, offset_value($t2)

• Store word instruction: sw $t1, offset_value($t2)
• These instruction compute a memory address by

adding the base register, which is $t2, to the 16-bit
signed offset field contained in the instruction.
• If the instruction is a store, the value to be stored
must be read from the register file where it resides
in $t1.
• If the instruction is a load, the value read from the
memory must be written into register file in the
specified register, which is $t1.

To implement a datapath for MIPS load and store instructions
• Therefore we will need the following:

– Both the register file and ALU.
– A unit to sign-extend the 16-bit offset field in the
instruction to a 32-bit signed value.
– A data memory unit to read from or write to.
• The data memory must be written on store

instructions, it has both read and write control
signals, an address input, and an input for the data
to be written into memory (check out next slide).

Data Memory and Sign-extension Units
M e m W r it e
1 6 3 2
A d d re s s R e a d S ig n
d a ta
e x t e n d
D a ta
W rite
d a ta m e m o ry
M e m R e a d
a . D a ta m e m o ry u n it b . S ig n - e x t e n s io n u n it

Datapath for MIPS load and store instructions
• Assume the instruction has already been fetched.
• The register number inputs for the register file

come from fields of the instruction, as does the
offset value, which after sign extension becomes
the second ALU input.
• The figure in the next slide shows the datapath for

a load or a store does a register access, followed
by a memory address calculation, then a read or
write from memory, and a write into the register file
if the instruction is a load.

Datapath for MIPS load and store instructions
A LU c o n tr o l
5 4
R e ad
r e g ist e r 1
Read M e m W rite
5 d ata 1
R e g is te r R e ad
n u m be r s r e g ist e r 2 Z e ro
Instruction R e g is te r s A L U
A L U
5
W r ite A d dre ss R e ad
r e s u lt
d ata
r e g ist e r
Read
d ata 2
W r ite D ata
D a ta W rite
d a ta
da ta m e m ory
R e g W rite
M e m R ea d
1 6 3 2
S ig n
e x te n d
a. D ata m e m o ry un it

Branch if equal (beq) instruction
• beq instruction has 3 operands, 2 registers that are

compared for equality, and a 16-bit offset used to
compute the branch target address relative to the
branch instruction address.
• Form: beq $t1, $t2, offset
• To implement this instruction, we must compute
the branch target address by adding the sign-
extended offset field of the instruction to the PC.

beq Instruction
• There are two details in the definition of the branch

instruction (for details see chapter 3):
1. The instruction set architecture specifies that the base

for the branch address calculation is the address of the
instruction following the branch.
Since we compute PC+4 (address of next instruction), it
is easy to use this value as the base for computing the
branch target address.
2. The architecture states that the offset field is shifted

left 2 bits so that it is a word offset; this shift increase
the effective range of the offset field by a factor of four.

beq Instruction (Cont.)
• If the two operands are equal (condition is true)

then the branch target address becomes the new
PC, and we say the branch is taken.
• If the two operands are not equal (condition is not
true) then the incremented PC should replace the
current PC, and we say the branch is not taken.
• The branch datapath must do two operations:
compute the branch target address and compare
the register contents.

A Datapath for MIPS beq Instruction
PC + 4 from instruction datapath

Branch
A dd Sum
target
Shift
left 2
5 R e ad A L U c o n tr o l
4
re g is te r 1
R e a d
5 d a ta 1
R e ad
Instruction re g is te r 2 To
R e g is t e r s
A L U Z e ro branch
W r ite
re g is te r
control
R e a d
logic
d a ta 2
W r ite
d a ta
R e g W r ite
16 32
S ig n
e x te n d

A Datapath for MIPS beq Instruction
• To compute the branch target address, the branch
datapath includes a sign extension unit and an adder.
• To perform the compare, we need to use the register file to
supply the two register operands.
• Also, the ALU provides an output signal that indicates
whether the result was 0, we can send the two register
operands to the ALU with the control set to do a subtract.
– If the Zero signal out of the ALU unit is asserted, then
the two values are equal.
• Since the offset was sign-extended from 16 bits, the shift
will throw away only “sign bits”.
• Control logic is used to decide whether the incremented
PC or branch target should replace the PC, based on the
Zero output of the ALU.

Combining datapaths for memory and R-type instructions
using multiplexors
R e g is te r s
A L U o p e r a tio n
R ead
A L U S rc
4 M e m W r i te Load
re g is t e r 1
R ead
R ead d a ta 1 M e m to R e g
re g is te r 2 Z e ro
instruction A LU ALU
W r ite R ea d R ead
r e s u lt A d d re s s
re g is t e r M d a ta
d a ta 2 M
u
u
W r ite x
D a ta x
d a ta
m e m o ry
W r i te
R e g W r it e
d a ta
Store
16 32
S ig n
M em R ead
e x te n d
R-Type
R-Type or Load
Combining datapaths for memory instructions, R-type
instructions, and instruction fetch
Add Sum
4
Registers
ALU operation
4
Instruction
Read
register 1 ALUSrc
MemWrite
Load
PC address Read
Read data 1 MemtoReg
register 2 Zero
Instruction ALU ALU

Write Read Read
result Address
register data 2 M data
Instruction u
M
memory u
Write x
Data x
data
memory
Write
RegWrite
data
16 32
Store
Sign
MemRead
extend
R-Type
R-Type or Load
Combining datapaths for memory instructions, R-type
instructions, and instruction fetch
• The combined datapaths in the previous slide

requires both an adder and an ALU, since the adder
is used to increment the PC while the other ALU is
used for executing the instruction in the same
clock cycle.
• In the next slide we build a simple datapath for the

MIPS architecture, which can execute the basic
instructions (load / store word, ALU operations,
and branches) in a single clock cycle.

Building the Datapath for MIPS Architecture
• Use multiplexors to stitch them together

PCSrc
M
Add u
x
4 Add ALU
result
Shift
left 2
Registers
Read 3 ALU operation
MemWrite
Read register 1 ALUSrc
PC Read
address Read data 1 MemtoReg
register 2 Zero
Instruction ALU ALU
Write Read Address Read
register M result data
data 2 u M
Instruction u
memory Write x Data x
data memory
Write
RegWrite data
16 32
Sign
extend MemRead

Building the Datapath for MIPS Architecture
• From the previous slide, the branch instruction

uses the main ALU for comparison of the register
operands, so the adder is used for computing the
branch target address.
• Also, an additional multiplexor is required to select

either the sequentially following instruction
address (PC + 4) or the branch target address to be
written into the PC.

Control
• Selecting the operations to perform (ALU, read/write, etc.)

• Controlling the flow of data (multiplexor inputs)
• Information comes from the 32 bits of the instruction
• Example:
add $8, $17, $18 Instruction Format:
000000 10001 10010 01000 00000 100000
• ALU's operation based on instruction type and function code

Control
• e.g., what should the ALU do with this instruction

• Example: lw $1, 100($2)
35 2 1 100
op rs rt 16 bit offset
• ALU control input
0000 AND
0001 OR
0010 add
0110 subtract
0111 set-on-less-than
1100 NOR
• Why is the code for subtract 0110 and not 0011?

Control
• Must describe hardware to compute 4-bit ALU control input

– given instruction type
00 = lw, sw ALUOp
01 = beq, computed from instruction type
10 = arithmetic
– function code for arithmetic
• Describe it using a truth table (can turn into gates):

0
M
Add u
x
ALU 1
4 Add
result
Shift
RegDst left 2
Branch
MemRead
Instruction [31–26] MemtoReg
Control
ALUOp
MemWrite
ALUSrc
RegWrite
Instruction [25–21] Read

Read register 1
PC address Read
Instruction [20–16] data 1
Read
register 2 Zero
Instruction 0
[31–0] ALU ALU Read
M Read 0 Address 1
u Write result data
Instruction data 2 M M
Instruction [15–11] x register u
memory u
1 x x
Write 1 0
data Registers Data
Write memory
data
Instruction [15–0] 16 32
Sign
ALU
extend
control
Instruction [5–0]
Memto- Reg Mem Mem

Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0
R-format 1 0 0 1 0 0 0 1 0
lw 0 1 1 1 1 0 0 0 0
sw X 1 X 0 0 1 0 0 0
beq X 0 X 0 0 0 1 0 1
Control
• Simple combinational logic (truth tables)
Inputs
Op5
Op4
Op3
Op2
Op1
Op0
Outputs
R-format Iw sw beq
RegDst
ALUSrc
MemtoReg
RegWrite
MemRead
MemWrite
Branch
ALUOp1
ALUOpO

Single Cycle Implementation
• Calculate cycle time assuming negligible delays except:

– memory (200ps),
ALU and adders (100ps),
register file access (50ps)
PCSrc
M
Add u
x
ALU
4 Add
result
Shift
left 2
Read ALUSrc ALU operation

Read register 1 4
PC address Read MemWrite
data 1
Read MemtoReg
register 2 Zero
Instruction ALU ALU
Registers Read Read
Write Address
data 2 result data M
Instruction register M
memory u u
x x
Write
data Data
Write memory
RegWrite data
16 32 MemRead
Sign
extend

Where we are headed
• Single Cycle Problems:

– what if we had a more complicated instruction like floating
point?
– wasteful of area
• One Solution:
– use a “smaller” cycle time
– have different instructions take different numbers of cycles
– a “multicycle” datapath:
Instruction
register
PC Address Data
A
Instruction Register #
or data Registers ALU ALUOut
Memory
Register #
Memory B
Data data Register #
register

Multicycle Approach
• We will be reusing functional units

– ALU used to compute address and to increment PC
– Memory used for instruction and data
• Our control signals will not be determined directly by instruction
– e.g., what should the ALU do for a “subtract” instruction?
• We’ll use a finite state machine for control

Multicycle Approach
• Break up the instructions into steps, each step takes a cycle

– balance the amount of work to be done
– restrict each cycle to use only one major functional unit
• At the end of a cycle
– store values for use in later cycles (easiest thing to do)
– introduce additional “internal” registers
PC 0
Instruction Read 0
M
u Address [25–21] register 1 M
x Read u
A x
1 Instruction data 1
Memory Read 1 Zero
[20–16] register 2
MemData 0 ALU ALU
Instruction M Registers ALUOut
[15–0] Instruction u Write result
Read
Write [15–11] x register B 0
data 2
data Instruction 1 4 1M
register Write u
0 data 2 x
Instruction M 3
[15–0] u
x
1
Memory 16 32
Sign Shift
data extend left 2
register

Instructions from ISA perspective
• Consider each instruction from perspective of ISA.

• Example:
– The add instruction changes a register.
– Register specified by bits 15:11 of instruction.
– Instruction specified by the PC.
– New value is the sum (“op”) of two registers.
– Registers specified by bits 25:21 and 20:16 of the instruction
Reg[Memory[PC][15:11]] <= Reg[Memory[PC][25:21]] op

Reg[Memory[PC][20:16]]
– In order to accomplish this we must break up the instruction.

(kind of like introducing variables when programming)

Breaking down an instruction
• ISA definition of arithmetic:
Reg[Memory[PC][15:11]] <= Reg[Memory[PC][25:21]] op

Reg[Memory[PC][20:16]]
• Could break down to:

– IR <= Memory[PC]
– A <= Reg[IR[25:21]]
– B <= Reg[IR[20:16]]
– ALUOut <= A op B
– Reg[IR[20:16]] <= ALUOut
• We forgot an important part of the definition of arithmetic!

– PC <= PC + 4

Idea behind multicycle approach
• We define each instruction from the ISA perspective (do this!)
• Break it down into steps following our rule that data flows through at
most one major functional unit (e.g., balance work across steps)
• Introduce new registers as needed (e.g, A, B, ALUOut, MDR, etc.)
• Finally try and pack as much work into each step

(avoid unnecessary cycles)
while also trying to share steps where possible
(minimizes control, helps to simplify solution)
• Result: Our book’s multicycle Implementation!

Five Execution Steps
• Instruction Fetch
• Instruction Decode and Register Fetch
• Execution, Memory Address Computation, or Branch Completion
• Memory Access or R-type instruction completion
• Write-back step
INSTRUCTIONS TAKE FROM 3 - 5 CYCLES!

Step 1: Instruction Fetch
• Use PC to get instruction and put it in the Instruction Register.

• Increment the PC by 4 and put the result back in the PC.
• Can be described succinctly using RTL "Register-Transfer Language"
IR <= Memory[PC];
PC <= PC + 4;
Can we figure out the values of the control signals?
What is the advantage of updating the PC now?

Step 2: Instruction Decode and Register Fetch
• Read registers rs and rt in case we need them

• Compute the branch address in case the instruction is a branch
• RTL:
A <= Reg[IR[25:21]];
B <= Reg[IR[20:16]];
ALUOut <= PC + (sign-extend(IR[15:0]) << 2);
• We aren't setting any control lines based on the instruction type

(we are busy "decoding" it in our control logic)

Step 3 (instruction dependent)
• ALU is performing one of three functions, based on instruction type
• Memory Reference:
ALUOut <= A + sign-extend(IR[15:0]);
• R-type:
ALUOut <= A op B;
• Branch:
if (A==B) PC <= ALUOut;

Step 4 (R-type or memory-access)
• Loads and stores access memory
MDR <= Memory[ALUOut];

or
Memory[ALUOut] <= B;
• R-type instructions finish
Reg[IR[15:11]] <= ALUOut;
The write actually takes place at the end of the cycle on the edge

Write-back step
• Reg[IR[20:16]] <= MDR;
Which instruction needs this?

Summary:

Simple Questions
• How many cycles will it take to execute this code?
lw $t2, 0($t3)
lw $t3, 4($t3)
beq $t2, $t3, Label #assume not
add $t5, $t2, $t3
sw $t5, 8($t3)
Label: ...
• What is going on during the 8th cycle of execution?

• In what cycle does the actual addition of $t2 and $t3 takes place?

Alllpdf PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Alllpdf PDF

Hochgeladen von

Copyright:

Verfügbare Formate

COMPUTER ORGANIZATION AND DESIGN 5th

Chapter 1 — Computer Abstractions and Technology — 2

Chapter 1 — Computer Abstractions and Technology — 3

Chapter 1 — Computer Abstractions and Technology — 5

 Use abstraction to simplify design

 Make the common case fast

 Performance via parallelism

 Performance via pipelining

 Performance via prediction

Chapter 1 — Computer Abstractions and Technology — 6

Chapter 1 — Computer Abstractions and Technology — 7

Chapter 1 — Computer Abstractions and Technology — 8

Chapter 1 — Computer Abstractions and Technology — 9

Chapter 1 — Computer Abstractions and Technology — 10

Chapter 1 — Computer Abstractions and Technology — 11

Chapter 1 — Computer Abstractions and Technology — 12

Year Technology Relative performance/cost

Chapter 1 — Computer Abstractions and Technology — 13

 Yield: proportion of working dies per wafer

Chapter 1 — Computer Abstractions and Technology — 14

 300mm wafer, 280 chips, 32nm technology

 Nonlinear relation to area and defect rate

Chapter 1 — Computer Abstractions and Technology — 16

Boeing 777 Boeing 777

Boeing 747 Boeing 747

Passenger Capacity Cruising Range (miles)

Boeing 777 Boeing 777

Boeing 747 Boeing 747

0 500 1000 1500 0 100000 200000 300000 400000

Cruising Speed (mph) Passengers x mph

Chapter 1 — Computer Abstractions and Technology — 17

Chapter 1 — Computer Abstractions and Technology — 18

 Example: time taken to run a program

 Clock period: duration of a clock cycle

Chapter 1 — Computer Abstractions and Technology — 22

Chapter 1 — Computer Abstractions and Technology — 24

 Weighted average CPI

Chapter 1 — Computer Abstractions and Technology — 26

Instructions Clock cycles Seconds

Chapter 1 — Computer Abstractions and Technology — 28

Chapter 1 — Computer Abstractions and Technology — 29

Chapter 1 — Computer Abstractions and Technology — 30

Dr. Yahya Tashtoush

Dr. Yahya Tashtoush

3 Instruction Formats: all 32 bits wide

op jump target J format

 Each arithmetic instruction performs one operation

 Instruction Format (R format)

Dr. Yahya Tashtoush

op 6-bits opcode that specifies the operation

Dr. Yahya Tashtoush

Dr. Yahya Tashtoush

Dr. Yahya Tashtoush

 Compiled MIPS code:

Dr. Yahya Tashtoush

Dr. Yahya Tashtoush

 Compiled MIPS code:

Dr. Yahya Tashtoush

Dr. Yahya Tashtoush

. . . 0001 1000 0x120040ac

word must be on natural word boundaries (a

 Compiled MIPS code:

Dr. Yahya Tashtoush