All

Computer Science 3725
Winter Semester, 2015
Pipelining Instruction Set Parallelism (ISP)

Pipelining is a technique which allows several instructions to overlap in time; different parts of several consecutive instructions are
executed simultaneously.
The basic structure of a pipelined system is not very different from
the multicycle implementation previously discussed.
In the pipelined implementation, however, resources from one cycle
cannot be reused by another cycle. Also, the results from each stage
in the pipeline must be saved in a pipeline register for use in the next
pipeline stage.
The following page shows the potential use of memory in a pipelined
system where each pipeline stage does the operations corresponding
to one of the cycles identified previously.
A pipelined implementation
WB
MEM
ALU
RD
IF
0
9 10 11 12 13
time (clock cycles)

Note that two memory accesses may be required in each machine
cycle (an instruction fetch, and a memory read or write.)
How could this problem be reduced or eliminated?
What is required to pipeline the datapath?

Recall that when the multi-cycle implementation was designed, information which had to be retained from cycle to cycle was stored in
a register until it was needed.
In a pipelined implementation, the results from each pipeline stage
must be saved if they will be required in the next stage.
In a multi-cycle cycle implementation, resources could be shared
by different cycles.
In a pipelined implementation, every pipeline stage must have all the
resources it requires on every clock cycle.
A pipelined implementation will therefore require more hardware
than either a single cycle or a multicycle implementation.
A reasonable starting point for a pipelined implementation would be
to add pipeline registers to the single cycle implementation.
We could have each pipeline stage do the operations in each cycle of
the multi-cycle implementation.
Note that in a pipelineds implementation, every instruction passes

through each pipeliner stage. This is quite different from the multicycle implementation, where a cycle is omitted if it is not required.
For example, this means that for every instruction requiring a reqister
write, this action happens four clock periods after the instruction is
fetched from instruction memory.
Furthermore, if an instruction requires no action in a particular
pipeline stage, any information required required by a later stage
must be passed through.
A processor with a reasonably complex instruction set may require
much more logic for a pipelined implementation than for a multi-cycle
implementation.
The next figure shows a first attempt at the datapath with pipeline
registers added.
1
M
U
X
0
Add
4
Add
Shift
left 2
Inst[2521]
PC
Read
address
Instruction
[310]
Instruction
Memory
Inst[2016]
Read
Register 1
Read
Register 2
Zero
Registers
Write
Read
Register
data 2
Write
data
Inst[150]
16
Inst[1511]
IF
Read
data 1
ID
Sign
extend
ALU
Address
0
M
U
X
1
Read
data
Write
data
Data
Memory
0
M
U
X
1
32
0
M
U
X
1
EX
MEM
WB
It is useful to note the changes that have been made to the datapath,
The most obvious change is, of course, the addition of the pipeline
registers.
The addition of these registers introduce some questions.
How large should the pipeline registers be?
Will they be the same size in each stage?
The next change is to the location of the MUX that updates the PC.
This must be associated with the IF stage. In this stage, the PC
should also be incremented.
The third change is to preserve the address of the register to be
written in the register file. This is done by passing the address along
the pipeline registers until it is required in the WB stage.
The output of the MUX which provides the write address is now the
pipeline register.
Pipeline control
Since five instructions are now executing simultaneously, the controller for the pipelined implementation is, in general, more complex.
It is not as complex as it appears on first glance, however.
For a processor like the MIPS, it is possible to decode the instruction
in the early pipeline stages, and to pass the control signals along the
pipeline in the same way as the data elements are passed through
the pipeline.
(This is what will be done in our implementation.)
A variant of this would be to pass the instruction field (or parts of
it) and to decode the instruction as needed for each stage.
For our processor example, since the datapath elements are the same
as for the single cycle processor, then the control signals required
must be similar, and can be implemented in a similar way.
All the signals can be generated early (in the ID stage) and passed
along the pipeline until they are required.
W
B
PCSrc
M
E
M
1
M
U
X
0
W
B
RegDst
M MemRead
E MemWrite
M Branch
E ALUSrc
X ALUop
RegWrite
W
B MemtoReg
Inst [3126]
Add
Add
4
Shift
left 2
Inst[2521]
PC
Read
address
Instruction
[310]
Instruction
Memory
Inst[2016]
Read
Register 1
Read
Register 2
Read
data 1
Zero
Registers
Write
Read
Register
data 2
Write
data
Inst[150]
16
Sign
extend
ALU
IF
ID
Read
data
Write
data
Data
Memory
0
M
U
X
1
32
Inst[50]
Inst[1511]
Address
0
M
U
X
1
ALU
control
0
M
U
X
1
EX
MEM
WB
Executing an instruction
In the following figures, we will follow the execution of an instruction
through the pipeline.
The instructions we have implemented in the datapath are those of
the simplest version of the single cycle processor, namely:
the R-type instructions
load
store
beq
We will follow the load instruction, as an example.
10
1
M
U
X
0
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add
4
Shift
left 2
Inst[2521]
PC
Read
address
Instruction
[310]
11
Instruction
Memory
Inst[2016]
Read
Register 1
Read
Register 2
Read
data 1
Zero
Registers
Write
Read
Register
data 2
Write
data
Inst[150]
16
Sign
extend
ALU
IF
ID
Read
data
Write
data
Data
Memory
0
M
U
X
1
32
Inst[50]
Inst[1511]
Address
0
M
U
X
1
ALU
control
0
M
U
X
1
EX
MEM
WB
1
M
U
X
0
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
4
Add
Shift
left 2
Inst[2521]
PC
Read
address
Instruction
[310]
12
Instruction
Memory
Inst[2016]
Read
Register 1
Read
Register 2
Read
data 1
Zero
Registers
Write
Read
Register
data 2
Write
data
Inst[150]
16
Sign
extend
ALU
IF
ID
Read
data
Write
data
Data
Memory
0
M
U
X
1
32
Inst[50]
Inst[1511]
Address
0
M
U
X
1
ALU
control
0
M
U
X
1
EX
MEM
WB
1
M
U
X
0
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
4
Add
Shift
left 2
Inst[2521]
PC
Read
address
Instruction
[310]
13
Instruction
Memory
Inst[2016]
Read
Register 1
Read
Register 2
Read
data 1
Zero
Registers
Write
Read
Register
data 2
Write
data
Inst[150]
16
Sign
extend
ALU
IF
ID
Read
data
Write
data
Data
Memory
0
M
U
X
1
32
Inst[50]
Inst[1511]
Address
0
M
U
X
1
ALU
control
0
M
U
X
1
EX
MEM
WB
1
M
U
X
0
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
4
Add
Shift
left 2
Inst[2521]
PC
Read
address
Instruction
[310]
14
Instruction
Memory
Inst[2016]
Read
Register 1
Read
Register 2
Read
data 1
Zero
Registers
Write
Read
Register
data 2
Write
data
Inst[150]
16
Sign
extend
ALU
IF
ID
Read
data
Write
data
Data
Memory
0
M
U
X
1
32
Inst[50]
Inst[1511]
Address
0
M
U
X
1
ALU
control
0
M
U
X
1
EX
MEM
WB
1
M
U
X
0
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
4
Add
Shift
left 2
Inst[2521]
PC
Read
address
Instruction
[310]
15
Instruction
Memory
Inst[2016]
Read
Register 1
Read
Register 2
Read
data 1
Zero
Registers
Write
Read
Register
data 2
Write
data
Inst[150]
16
Sign
extend
ALU
IF
ID
Read
data
Write
data
Data
Memory
0
M
U
X
1
32
Inst[50]
Inst[1511]
Address
0
M
U
X
1
ALU
control
0
M
U
X
1
EX
MEM
WB
Representing a pipeline pictorially

These diagrams are rather complex, so we often represent a pipeline
as simpler figures representing the structure as follows:
LW
IM
REG
SW
ALU
IM
REG
ADD
IM
DM
REG
ALU
DM
REG
ALU
DM
REG
Often an even simpler representation is sufficient:
IF
ID
IF
ALU MEM WB
ID
IF
ALU MEM WB
ID
ALU MEM WB
The following figure shows a pipeline with several instructions in

progress:
16
REG
LW
ADD
SW
IM
REG
IM
ALU
REG
IM
DM
REG
ALU
DM
REG
ALU
DM
REG
ALU
DM
REG
ALU
DM
REG
ALU
DM
REG
17
SUB
BEQ
AND
IM
REG
IM
REG
IM
REG
REG
Pipeline hazards
There are three types of hazards in pipelined implementations
structural hazards, control hazards, and data hazards.
Structural hazards
Structural hazards occur when there are insufficient hardware resources to support the particular combination of instructions presently
being executed.
The present implementation has a potential structural hazard if there
is a single memory for data and instructions.
Other structural hazards cannot happen in a simple linear pipeline,
but for more complex pipelines they may occur.
Control hazards
These hazards happen when the flow of control changes as a result
of some computation in the pipeline.
One question here is what happens to the rest of the instructions in
the pipeline?
Consider the beq instruction.
The branch address calculation and the comparison are performed in
the EX cycle, and the branch address returned to the PC in the next
cycle.
18
What happens to the instructions in the pipeline following a successful branch?

There are several possibilities.
One is to stall the instructions following a branch until the branch
result is determined. (Some texts refer to a stall as a bubble.)
This can be done by the hardware (stopping, or stalling the pipeline
for several cycles when a branch instruction is detected.)
beq
IF
ID
stall
add
ALU MEM WB
stall
stall
IF
ID
IF
lw
ALU MEM WB
ID
ALU MEM WB
It can also be done by the compiler, by placing several nop instructions following a branch. (It is not called a pipeline stall then.)
beq
nop
nop
nop
IF
ID
IF
ALU MEM WB
ID
IF
ALU MEM WB
ID
ALU MEM WB
IF
ID
add
IF
ALU MEM WB
ID
IF
lw
19
ALU MEM WB
ID
ALU MEM WB
Another possibility is to execute the instructions in the pipeline. It

is left to the compiler to ensure that those instructions are either
nops or useful instructions which should be executed regardless of
the branch test result.
This is, in fact, what was done in the MIPS. It had one branch delay
slot which the compiler could with a useful instruction about 50%
of the time.
beq
IF
branch delay slot
ID
ALU MEM WB
IF
ID
instruction at
branch target
IF
ALU MEM WB
ID
ALU MEM WB
We saw earlier that branches are quite common, and inserting many
stalls or nops is inefficient.
For long pipelines, however, it is difficult to find useful instructions to
fill several branch delay slots, so this idea is not used in most modern
processors.
20
Branch prediction
If branches could be predicted, there would be no need for stalls.
Most modern processors do some form of branch prediction.
Perhaps the simplest is to predict that no branch will be taken.
In this case, the pipeline is flushed if the branch prediction is wrong,
and none of the results of the instructions in the pipeline are written
to the register file.
How effective is this prediction method?
What branches are most common?
Consider the most common control structure in most programs
the loop.
In this structure, the most common result of a branch is that it is
taken; consequently the next instruction in memory is a poor prediction. In fact, in a loop, the branch is not taken exactly once at
the end of the loop.
A better choice may be to record the last branch decision, (or the
last few decisions) and make a decision based on the branch history.
Branches are problematic in that they are frequent, and cause inefficiencies by requiring pipeline flushes. In deep pipelines, this can be
computationally expensive.
21
Data hazards
Another common pipeline hazard is a data hazard. Consider the
following instructions:
add $r2, $r1, $r3
add $r5, $r2, $r3
Note that $r2 is written in the first instruction, and read in the
second.
In our pipelined implementation, however, $r2 is not written until
four cycles after the second instruction begins, and therefore three
bubbles or nops would have to be inserted before the correct value
would be read.
add $r2, $r1, $r3
IF
ID
ALU MEM WB
data hazard
add $r5, $r2, $r3
IF
ID
ALU MEM WB
The following would produce a correct result:

IF
ID
nop
ALU MEM WB
nop
nop
IF
ID
ALU MEM WB
The following figure shows a series of pipeline hazards.
22
add $2, $1, $3
sub $5, $2, $3
23
and $7, $6, $2
beq $0, $2, 25
sw $7, 100($2)
IM
REG
IM
ALU
REG
IM
DM
REG
ALU
DM
REG
ALU
DM
REG
ALU
DM
REG
ALU
DM
REG
IM
REG
IM
REG
REG
Handling data hazards

There are a number of ways to reduce data hazards.
The compiler could attempt to reorder instructions so that instructions reading registers recently written are not too close together,
and insert nops where it is not possible to do so.
For deep pipelines, this is difficult.
Hardware could be constructed to detect hazards, and insert stalls
in the pipeline where necessary.
This also slows down the pipeline (it is equivalent to adding nops.)
An astute observer could note that the result of the ALU operation
is stored in the pipeline register at the end of the ALU stage, two
cycles before it is written into the register file.
If instructions could take the value from the pipeline register, it could
reduce or eliminate many of the data hazards.
This idea is called forwarding.
The following figure shows how forwarding would help in the pipeline
example shown earlier.
24
add $2, $1, $3
sub $5, $2, $3
and $7, $6, $2
IM
REG
IM
ALU
REG
IM
DM
REG
ALU
DM
REG
ALU
DM
REG
ALU
DM
REG
ALU
DM
REG
25
IM
beq $0, $2, 25
sw $7, 100( $2)
REG
IM
forwarding
Note how forwarding eliminates the data hazards in these cases.
REG
REG
Implementing forwarding
Note that from the previous examples there are now two potential
additional sources of operands for the ALU during the EX cycle
the EX/MEM pipeline register, and the the MEM/WB pipeline.
What additional hardware would be required to provide the data
from the pipeline stages?
The data to be forwarded could be required by either of the inputs
to the ALU, so two MUXs would be required one for each ALU
input.
The MUXs would have three sources of data; the original data from
the registers (in pipeline stage ID/EX) or the two pipeline stages to
be forwarded.
Looking only at the datapath for R-type operations, the additional
hardware would be as follows:
26
ID/EX
Read R1
EX/MEM
MEM/WB
M
U
X
Read
Data 1
Read R2
zero
ForwardA
Registers
Write R
Write data
ALU
result
Read
M
U
X
Data 2
Read
address
Write
Read
Data
Memory
Write
Data
ForwardB
rt
rd
1
M
U
X
There would also have to be a forwarding unit which provides

control signals for these MUXs.
27
Data
M
U
X
Forwarding control
Under what conditions does a data hazard (for R-type operations)
occur?
It is when a register to be read in the EX cycle is the same register
as one targeted to be written, and is held in either the EX/MEM
pipeline register or the MEM/WB pipeline register.
These conditions can be expressed as:
1. EX/MEM.RegisterRd = ID/EX.RegisterRs or ID/EX.RegisterRt
2. MEM/WB.RegisterRd = ID/EX.RegisterRs or ID/EX.RegisterRt
Some instructions do not write registers, so the forwarding unit
should check to see if the register actually will be written. (If it
is to be written, the control signal RegWrite, also in the pipeline,
will be set.)
Also, an instruction may try to write some value in register 0. More
importantly, it may try to write a non-zero value there, which should
not be forwarded register 0 is always zero.
Therefore, register 0 should never be forwarded.
28
The register control signals ForwardA and ForwardB have values

defined as:
MUX control Source
00
ID/EX
Explanation
Operand comes from the register file
(no forwarding)
01
MEM/WB Operand forwarded from a memory

operation or an earlier ALU operation
10
EX/MEM Operand forwarded from the previous ALU operation
The conditions for a hazard with a value in the EX/MEM stage are:
if (EX/MEM.RegWrite
and (EX/MEM.RegisterRd 6= 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs))
then ForwardA = 10
if (EX/MEM.RegWrite
and (EX/MEM.RegisterRd 6= 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
then ForwardB = 10
29
For hazards with the MEM/WB stage, an additional constraint is

required in order to make sure the most recent value is used:
if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd 6= 0)
and (EX/MEM.RegisterRd 6= ID/EX.RegisterRs)
and (MEM/WB.RegisterRd = ID/EX.RegisterRs))
then ForwardA = 01
if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd 6= 0)
and (EX/MEM.RegisterRd 6= ID/EX.RegisterRt)
and (MEM/WB.RegisterRd = ID/EX.RegisterRt))
then ForwardB = 01
The datapath with the forwarding control is shown in the next figure.
30
ID/EX
Read R1
EX/MEM
M
U
X
Read
Data 1
Read R2
zero
ForwardA
Registers
Write R
Write data
MEM/WB
ALU
result
Read
Data 2
M
U
X
Write
Read
Data
Data
Memory
Write
Data
ForwardB
rs
rt
Read
address
rd
1
EX/MEM.RegisterRd
M
U
X
Forwarding
unit
MEM/WB.RegisterRd
For a datapath with forwarding, the hazards which are fixed by forwarding are not considered hazards any more.
31
M
U
X
Forwarding for other instructions

What considerations would have to be made if other instructions
were to make use of forwarding?
The immediate instructions
The major difference is that the B input to the ALU comes from the
instruction and sign extension unit, so the present MUX controlled
by the ALUSrc signal could still be used as input to the ALU.
The major change is that one input to this MUX is the output of the
MUX controlled by ForwardB.
The load and store instructions
These will work fine, for loads and stores following R-type instructions.
There is a problem, however, for a store following a load.
lw $2, 100($3)
sw $2, 400($3)
IM
REG
IM
ALU
REG
DM
REG
ALU
DM
Note that this situation can also be resolved by forwarding.

It would require another forwarding controller in the MEM stage.
32
REG
There is a situation which cannot be handled by forwarding, however.

Consider a load followed by an R-type operation:
lw $2, 100($3)
IM
add $4,$3, $2
REG
IM
ALU
REG
DM
REG
ALU
DM
REG
Here, the data from the load is not ready when the r-type instruction
requires it we have a hazard.
What can be done here?
lw $2, 100($3)
IM
REG
STALL
ALU
DM
IM
REG
REG
ALU
DM
add $4,$3, $2
With a stall, forwarding is now possible.

It is possible to accomplish this with a nop, generated by a compiler.
Another option is to build a hazard detection unit in the control
hardware to detect this situation.
33
REG
The condition under which the hazard detection circuit is required

to insert a pipeline stall is when an operation requiring the ALU
follows a load instruction, and one of the operands comes from the
register to be written.
The condition for this is simply:
if (ID/EX.MemRead
and (ID/EX.RegisterRt = IF/ID.RegisterRs)
or (ID/EX.RegisterRt = IF/ID.RegisterRt))
then STALL
34
Forwarding with branches

For the beq instruction, if the comparison is done in the ALU, the
forwarding already implemented is sufficient.
add $2,$3, $4
IM
REG
ALU
IM
beq $2, $3,25
REG
DM
REG
ALU
DM
REG
In the MIPS processor, however, the branch instructions were implemented to require only two cycles. The instruction following the
branch was always executed. (The compiler attempted to place a
useful instruction in this jump delay slot, but if it could not, an
nop was placed there.)
The original MIPS did not have forwarding, but it is useful to consider
the kinds of hazards which could arise with this instruction.
Consider the sequence
add $2, $3, $4
beq $2, $5, 25
IF
ID
IF
ALU MEM WB
ID
ALU MEM WB
Here, if the conditional test is done in the ID stage, there is a hazard

which cannot be resolved by forwarding.
35
In order to correctly implement this instruction in a processor with

forwarding, both forwarding and hazard detection must be employed.
The forwarding must be similar to that for the ALU instructions,
and the hazard detection similar to that for the load/ALU type instructions.
Presently, most processors do not use a branch delay slot for branch
instructions, but use branch prediction.
Typically, there is a small amount of memory contained in the processor which records information about the last few branch decisions
for each branch.
In fact, individual branches are not identified directly in this memory;
the low order address bits of the branch instruction are used as an
identifier for the branch.
This means that sometimes several branches will be indistinguishable
in the branch prediction unit. (The frequency of this occurrence
depends on the size of the memory used for branch prediction.)
We will discuss branch prediction in more depth later.
36
Exceptions and interrupts

Exceptions are a kind of control hazard.
Consider the overflow exception discussed previously for the multicycle implementation.
In the pipelined implementation, the exception will not be identified
until the ALU performs the arithmetic operation, in stage 3.
The operations in the pipeline following the instruction causing the
exception must be flushed. As discussed earlier, this can be done by
setting the control signals (now in pipeline registers) to 0.
The instruction in the IF stage can be turned into a nop.
The control signals ID.flush AND EX.flush control the MUXs
which zero the control lines.
The PC must be loaded with a memory value at which the exception
handler resides (some fixed memory location).
This can be done by adding another input to the PC MUX.
The address of the instruction causing the exception must then be
saved in the EPC register. (Actually, the value PC + 4 is saved).
Note that the instruction causing the exception cannot be allowed
to complete, or it may overwrite the register value which caused the
overflow. Consider the following instruction:
add $1, $1, $2
The value in register 1 would be overwritten if the instruction finished.
37
The datapath, with exception handling for overflow:

IF.Flush
EX.Flush
ID.Flush
Hazard
detection
unit
40000040
ID/EX
M
u
x
WB
Control
0
M
u
x
M
u
x
M
0
EX
IF/ID
EX/MEM
M
u
x
Cause
WB
MEM/WB
WB
Except
PC
4
Shift
left 2
Registers
PC
M
u
x
Instruction
memory
ALU
M
u
x
Sign
extend
M
u
x
Forwarding
unit
38
Data
memory
M
u
x
Interrupts can be handled in a way similar to that for exceptions.

Here, though, the instruction presently being completed may be allowed to finish, and the pipeline flushed.
(Another possibility is to simply allow all instructions presently in
the pipeline to complete, but this will increase the interrupt latency.)
The value of the PC + 4 is stored in the EPC, and this will be the
return address from the interrupt, as discussed earlier.
Note that the effect of an interrupt on every instruction will have to
be carefully considered what happens if an interrupt occurs near
a branch instruction?
39
Superscalar and superpipelined processors

Most modern processors have longer pipelines (superpipelined) and
two or more pipelines (superscalar) with instructions sent to each
pipeline simultaneously.
In a superpipelined processor, the clock speed of the pipeline can be
increased, while the computation done in each stage is decreased.
In this case, there is more opportunity for data hazards, and control
hazards.
In the Pentium IV processor, pipelines are 20 stages long.
In a superscalar machine, there may be hazards among the separate
pipelines, and forwarding can become quite complex.
Typically, there are different pipelines for different instruction types,
so two arbitrary instructions cannot be issued at the same time.
Optimizing compilers try to generate instructions that can be issued
simultaneously, in order to keep such pipelines full.
In the Pentium IV processor, there are six independent pipelines,
most of which handle different instruction types.
In each cycle, an instruction can be issued for each pipeline, if there
is an instruction of the appropriate type available.
40
Dynamic pipeline scheduling

Many processors today use dynamic pipeline scheduling to find
instructions which can be executed while waiting for pipeline stalls
to be resolved.
The basic model is a set of independent state machines performing
instruction execution; one unit fetching and decoding instructions
(possibly several at a time), several functional units performing the
operations (these may be simple pipelines), and a commit unit which
writes results in registers and memory in program execution order.
Generally, the commit unit also kills off results obtained from
branch prediction misses and other speculative computation.
In the Pentium IV processor, up to six instructions can be issued in
each clock cycle, while four instructions can be retired in each cycle.
(This clearly shows that the designers anticipated that there would
be many instructions issued on average 1/3 of the instructions
that would be aborted.)
41
Instruction fetch
and decode unit
Functional
units
In-order issue
Reservation
station
Reservation
station
Reservation
station
Reservation
station
Integer
Integer
Floating
point
Load/
Store
Out-of-order execute
In-order commit
Commit
unit
Dynamic pipeline scheduling is used in the three most popular processors in machines today the Pentium II, III, and IV machines,
the AMD Athlon, and the Power PC.
42
A generic view of the Pentium P-X and the Power PC

pipeline
Data
cache
PC
Instruction
cache
Branch
prediction
Instruction queue
Register file
Decode/dispatch unit
Reservation
station
Branch
Reservation
station
Integer
Reservation
station
Integer
Reservation
station
Floating
point
Commit
unit
Reorder
buffer
43
Reservation
station
Reservation
station
Store
Load
Complex
integer
Load/
store
Speculative execution
One of the more important ways in which modern processors keep

their pipelines full is by executing instructions out of order and
hoping that the dynamic data required will be available, or that the
execution thread will continue.
Two cases where speculative computation are common are the store
before load case, where normally if a data element is stored, the
element being loaded does not depend on the element being stored.
The second case is at a branch both threads following the branch
may be executed before the branch decision is taken, but only the
thread for the successful path would be committed.
Note that the type of speculation in each case is different in the
first, the decision may be incorrect; in the second, one thread will
be incorrect.
44
Effects of Instruction Set Parallelism on programs

We have seen that data and control hazards can sometimes dramatically reduce the potential speed gains that ISP, and pipelining in
particular, offer.
Programmers (and/or compilers) can do several things to mitigate
against this. In particular, compiler technology has been developed
to provide code that can run more effectively on such datapaths.
We will look at some simple code modifications that are commonly
used in compilers to develop more efficient code for processors with
ISP.
Consider the following simple C program:
for (i = 0; i < N; i++)
{
Y[i] = A * X[i] + Y[i];
}
As simple as it is, this is a common loop in scientific computation,
and is called a SAXPY loop. (Sum of A X Plus Y).
Basically, it is a vector added to another vector times a scalar.
45
Let us write this code in simple MIPS assembler, assuming that

we have a multiply instruction that is similar in form to the add
instruction, and that A, X, and Y are 32 bit integer values. Further,
assume N is already stored in register $s1, A is stored in register
$s2, the start of array X is stored in register $s3, and the start of
array Y is stored in register $s4. Variable i will use register $s0.
Loop:
add
$s0, $0, $0
# initialize i to 0
lw
$t0, 0($s3)
# load X[i] into register $t0
lw
$t1, 0($s4)
# load Y[i] into register $t1
addi
$s0, $s0, 1
# increment i
mul
$t0, $t0, $s2
# $t0 = A * X[i]
add
$t1, $t0, $t1
# $t1 = A * X[i] + Y[i]
sw
$t1, 0($s4)
# store result in Y[i]
addi $s3, $s3, 4
# increment pointer to array X
addi $s4, $s4, 4
# increment pointer to array Y
bne
# jump back until the counter
$s0, $s1, Loop
# reaches N
This is a fairly direct implementation of the loop, and is not the most
efficient code.
For example, the variable i need not be implemented in this code,
we could use the array index for one of the vectors instead, and use
the final array address (+4) as the termination condition.
Also, this code has numerous data dependencies, some of which may
be reduced by reordering the code.
46
Using this idea, register $s1 would now be set by the compiler to
have the value of the start of array X (or Y), plus 4 (N + 1).
Reordering, and rescheduling the previous code for the MIPS:
Loop:
lw
$t0, 0($s3)
lw
$t1, 0($s4)
addi $s3, $s3, 4
addi $s4, $s4, 4
# increment pointer to array y
mul
# $t0 = A * X[i]
$t0, $t0, $s2
nop
# added because of dependency
nop
# on $t0
nop
add
$t1, $t0, $t1
# $t1 = A * X[i] + Y[i]
nop
$ as above
nop
bne
$s3, $s1, Loop

# reaches &X[0] + 4*(N+1)
sw
$t1, -4($s4)
# store result in Y (why -4?)
Note that variable i is no longer used, the code is somewhat reordered, nop instructions are added to preserve the correct execution
of the code, and the single branch delay slot is used. The dependencies causing hazards relate to registers $t0 and $t1. (This still may
not be the most efficient schedule.)
A total of 5 nop instructions are used, out of 13 instructions.
This is not very efficient!
47
Loop unrolling
Suppose we rewrite this code to correspond to the following C program:
for (i = 0; i < N; i+=2)
{
Y[i] = A * X[i] + Y[i];
Y[i+1] = A * X[i+1] + Y[i+1];
}
As long as the number of iterations is a multiple of 2, this is equivalent.
This loop is said to be unrolled once. Each iteration of this loop
does the same computation as two of the previous iterations.
A loop can be unrolled multiple times.
Does this save any execution time? If so, how?
48
The following is a rescheduled assembly code for the unrolled loop.

Note that the number of nop instructions is reduced, as well as reducing the number of array pointer additions.
Two additional registers ($t2 and $t3) were required.
Loop:
lw
$t0, 0($s3)
lw
$t2, 4($s3)
# load X[i+1] into register $t2
lw
$t1, 0($s4)
lw
$t3, 4($s4)
# load Y[i+1] into register $t3
mul
$t0, $t0, $s2
# $t0 = A * X[i]
mul
$t2, $t2, $s2
# $t2 = A * X[i+1]
addi $s3, $s3, 8
addi $s4, $s4, 8
# increment pointer to array y
add
$t1, $t0, $t1
# $t1 = A * X[i] + Y[i]
add
$t2, $t2, $t3
# $t2 = A * X[i+1] + Y[i+1]
nop
# $t0 dependency
nop
sw
$t1, -8($s4)
# store result in Y[i]
bne
$s3, $s1, Loop

# reaches &X[0] + 4*(N+1)
sw
$t2, -4($s4)
# store result in Y[i+1]
This code requires 15 instruction executions to complete two iterations of the original loop; the original loop required 2 13, or 26
instruction executions to do the same computation.
With additional unrolling, all nop instructions could be eliminated.
49
Loop merging
Consider the following computation using two SAXPY loops:
for (i = 0; i < N; i++)
{
Y[i] = A * X[i] + Y[i];
}
for (i = 0; i < N; i++)
{
Z[i] = A * X[i] + Z[i];
}
Clearly, it is possible to combine both those loops into one, and it
would be obviously a bit more efficient (only one branch, rather than
two).
for (i = 0; i < N; i++)
{
Y[i] = A * X[i] + Y[i];
Z[i] = A * X[i] + Z[i];
}
In fact, on a pipelined processor this may be much more efficient
than the original. This code segment can achieve the same savings
as a single loop unrolling on the MIPS processor.
50
Common sub-expression elimination

In the previous code, there is one more optimization that can improve
the performance (for both pipelined and non-pipelined implementations). It is equivalent to the following:
for (i = 0; i < N; i++)
{
C = A * X[i];
Y[i] = C + Y[i];
Z[i] = C + Z[i];
}
In this case a sub-expression common to both lines in the loop was
factored out, so it is evaluated only once per loop iteration. Likely,
if variable C is local to the loop, it will not even correspond to an
actual memory location in the code, but rather just be implemented
as a register.
Modern compilers implement all of these optimizations, and a great
many more, in order to extract a higher efficiency from modern processors with large amounts of ISP.
51
Recursion and ISP

It is interesting (and left as an exercise for you) to consider how optimization of a recursive function could be implemented in a pipelined
architecture.
Following is a simple factorial function in C, and its equivalent in
MIPS assembly language.
C program for factorial (recursive)
main ()
{
printf ("the factorial of 10 is %d\n", fact(10))
}
int fact (int n)
{
if (n < 1)
return 1;
else
return (n * fact (n-1));
}
Note that there are no loops here to unroll; it is possible to do multiple
terms of the factorial in one function call, but it would be rather more
difficult for a compiler to discover this fact.
52
# Mips assembly code showing recursive function calls:

.text
# Text section
.align 2
# Align following on word boundary.
.globl main
# Global symbol main is the entry
.ent main
# point of the program.
main:
subiu $sp,$sp,32
# Allocate stack space for return

# address and local variables (32
# bytes minimum, by convention).
# (Stack "grows" downward.)
sw
$ra, 20($sp) # Save return address
sw
$fp, 16($sp) # Save old frame pointer
addiu $fp $sp,28
# Set up frame pointer
li
$a0, 10
# put argument (10) in $a0
jal
fact
# jump to factorial function
# the factorial function returns a value in register $v0

la
$a0, $LC
# Put format string pointer in $a0
move $a1, $v0
# put result in $a1
jal printf
# print the result using the C

# function printf
53
# restore saved registers

lw
$ra, 20($sp)
# restore return address
lw
$fp, 16($sp)
# Save old frame pointer
addiu $sp $sp,32
# Pop stack frame
jr
# return to caller (shell)
$ra
.rdata
$LC:
.ascii "The factorial of 10 is "
Now the factorial function itself, first setting up the function call
stack, then evaluating the function, and finally restoring saved register values and returning:
# factorial function
.text
# Text section
fact:
subiu $sp,$sp,32
# Allocate stack frame (32 bytes)
sw
$ra, 20($sp)
# Save return address
sw
$fp, 16($sp)
# Save old frame pointer
addiu $fp $sp,28
# Set up frame pointer
sw
# Save argument (n)
$a0, 0($fp)
54
# here we do the required calculation

# first check for terminal condition
bgtz
$a0, $L2
# Branch if n > 0
li
$v0, 1
# Return 1
jr
$L1
# Jump to code to return
# do recursion
$L2:
subiu $a0, $a0, 1
# subtract 1 from n
jal
# jump to factorial function
fact
# returning fact(n-1) in $v0

lw
$v1, 0($fp)
# Load n (saved earlier) into $v1
mul
$v0, $v0, $v1 # compute (fact(n-1) * n)

# and return result in $v0
restore saved registers and return
$L1:
# result is in $2
lw
$ra, 20($sp)
# restore return address
lw
$fp, 16($sp)
# Restore old frame pointer
addiu $sp, $sp,32
# pop stack
jr
# return to calling program
$ra
For this simple example, the data dependency in the recursion relates
to register $v1.
55
Branch predication revisited

We mentioned earlier that we can use the history of a branch to
predict whether or not a branch is taken.
The simplest such branch predictor uses a single bit of information
whether or not the branch was taken the previous time, and simply
predicts the previous result. If the prediction is wrong, the bit is
updated.
It is instructive to consider what happens in this case for a loop
structure that is executed repeatedly. It will mispredict exactly twice
once on loop entry, and once on loop exit.
Generally, in machines with long pipelines, and/or machines using
speculative execution, a branch miss (a wrong prediction by the
branch predictor) results in a costly pipeline flush.
Modern processors use reasonably sophisticated branch predictors.
Let us consider a branch predictor using two bits of information;
essentially, the result of the two previous branches.
Following is a state machine describing the operation of a 2-bit branch
predictor (it has 4 states):
56
Taken
Weakly taken
Not taken
Predict taken
Predict taken
Taken
Taken
Not taken
Taken
Predict not taken
Predict not taken

Not taken
Weakly not taken

Not taken
Again, looking at what happens in a loop that is repeated, at the end

of the loop there will be a misprediction, and the state machine will
move to the weakly taken state. The next time the loop is entered,
the prediction will still be correct, and the state machine will again
move to the strongly taken state.
Since mispredicting a branch requires a pipeline flush, modern processors implement several different branch predictors operating in
parallel, with the hardware choosing the most accurate of those predictors for the particular branch. This type of branch predictor is
called a tournament predictor.
The information about recent branches is held in a small amount of
memory for each branch; the particular branches are referenced by
a simple hash function, usually the low order bits of the branch instruction. If there are many branches, it is possible that two branches
have the same hash value, and they may interfere with each other.
57
The Memory Architecture

For a simple single processor machine, the basic programmers view
of the memory architecture is quite simple the memory is a single,
monolithic block of consecutive memory locations.
It is connected to the memory address lines and the memory data
lines (and to a set of control lines; e.g. memory read and memory
write) so that whenever an address is presented to the memory the
data corresponding to that address appears on the data lines.
In practice, however, it is not economically feasible to provide a large
quantity of fast memory (memory matching the processor speed)
for even a modest computer system.
In general, the cost of memory scales with speed; fast (static) memory
is considerably more expensive than slower (dynamic) memory.
58
Memory
Memory is often the largest single component in a system, and consequently requires some special care in its design. Of course, it is
possible to use the simple register structures we have seen earlier,
but for large blocks of memory these are usually wasteful of chip
area.
For designs which require large amounts of memory, it is typical to
use standard memory chips these are optimized to provide large
memory capacity at high speed and low cost.
There are two basic types of memory static and dynamic. The
static memory is typically more expensive and with a much lower
capacity, but very high access speed. (This type of memory is often
used for high performance cache memories.) The single transistor
dynamic memory is usually the cheapest RAM, with very high capacity, but relatively slow. It also must be refreshed periodically (by
reading or writing) to preserve its data. (This type of memory is
typically used for the main memory in computer systems.)
The following diagrams show the basic structures of some commonly
used memory cells in random access memories.
59
Static memory
X-enable
VDD
M5
s
s
M1
s
M7
data
M4
M2
s
s
M6
M3
s
M8
Y-enable
data
This static memory cell is effectively an RS flip flop, where the input
transistors are shared by all memory cells in the same column.
The location of a particular cell is determined by the presence of data,
through the y-enable signal and the opening of the pass transistors
to the cell, (M5, M6) by the x-enable signal.
60
4-transistor dynamic memory the pull-up transistors are shared

among a column of cells. Refresh is accomplished here by switching
in the pull-up transistors M9 and M10.
refresh
s
s
M9
M10
VDD
X-enable
M5
M1
s
M7
data
s s
s
M6
M3
s
M8
Y-enable
data
61
3-transistor dynamic memory here, the inverter on the left of the

original static cell is also added to the refresh circuitry.
X-enable
M5
s
s
W
s
s
M7
data in
M3
Refresh
P
VDD
M6
s
s
VDD
s
M8
Y-enable
62
data out
For refresh, initially R=1, P=1, W=0 and the contents of memory
are stored on the capacitor. R is then set to 0, and W to 1, and the
value is stored back in memory, after being restored in the refresh
circuitry.
63
1-transistor dynamic memory
X-enable
s
M5
Refresh and
control circuitry
data in/out
This memory cell is not only dynamic, but a read destroys the contents of the memory (discharges the capacitor), and the value must
be rewritten. The memory state is determined by the charge on the
capacitor, and this charge is detected by a sense amplifier in the
control circuitry. The amount of charge required to store a value
reliably is important in this type of cell.
64
For the 1-transistor cell, there are several problems; the gate capacitance is too small to store enough charge, and the readout is
destructive. (They are often made with a capacitor constructed over
the top of the transistor, to save area.) Also, the output signal is
usually quite small. (1M bit dynamic RAMs may store a bit using
only 50,000 electrons!) This means that the sense amplifier must
be carefully designed to be sensitive to small charge differences, as
well as to respond quickly to changes.
65
The evolution of memory cells
VDD
M2
M5
s
s
M1
s
M4
s
s
M6
M5
M3
M1
6-transistor static RAM
M5
s
s
s s
s
M6
M3
s
4-transistor dynamic RAM
M5
M6
M3
66
The following slides show some of the ways in which single transistor
memory cells can be reduced in area to provide high storage densities.
(Taken from Advanced Cell Structures for Dynamic RAMS,
IEEE Circuits and Devices, V. 5 No. 1, pp.2736)
The first figure shows a transistor beside a simple two plate capacitor,
with both the capacitor and the transistor fabricated on the plane of
the surface of the silicon substrate:
67
The next figure shows a stacked transistor structure in which the

capacitor is constructed over the top of the transistor, in order to
occupy a smaller area on the chip:
68
Another way in which the area of the capacitor is reduced is by

constructing the capacitor in a trench in the silicon substrate. This
requires etching deep, steep-walled structures in the surface of the
silicon:
The circuit density can be increased further by placing the transistor

over the top of the trench capacitor, or by implementing the capacitor
in the sidewall of the trench.
The following figure shows the evolution of the trench capacitor dram:
69
70
Another useful memory cell, for particular applications, is a dual-port

(or n-port) memory cell. This can be accomplished in the previous
memory cells by adding a second set of x-enable and y-enable lines,
as follows:
X0-enable
s
s
s
VDD
X1-enable
s
s
s
s
data1
data0
Y1-enable
Y0-enable
71
data0
data1
The memory hierarchy

Typically, a computer system has a relatively small amount of very
high speed memory, (with typical access times of from 0.25 5 ns.)
called a cache where data from frequently used memory locations
may be temporarily stored.
This cache is connected to a much larger main memory which is
a medium speed memory, currently likely to be dynamic memory
with access time from 20200 ns. Cache memory access times are
typically 10 to 100 times faster than main memory access times.
After the initial access, however, modern main memory components (SDRAM in particular) can deliver a burst of sequential accesses at much higher speed, matching the speed of the processors
memory bus presently 400 - 800 MHz.
The largest block of memory in a modern computer system is
usually one or more large magnetic disks, on which data is stored
in fixed size blocks of from 256 to 8192 bytes or larger. This disk
memory is usually connected directly to the main memory, and has a
variable access time depending on how far the disk head must move
to reach the appropriate track, and how much the disk must rotate
to reach the appropriate sector for the data.
72
A modern high speed disk has a track-to-track latency of about 1

ms., and the disk rotates at a speed of 7200 RPM. The disk therefore
makes one revolution in 1/120th of a second, or 8.4 ms. The average
rotational latency is therefore about 4.2 ms. Faster disks (using
smaller diameter disk plates) can rotate even faster.
A typical memory system, connected to a medium-to-large size computer (a desktop or server configuration) might consist of the following:
128 K 2000 K bytes of cache memory (0.320ns)
1024 M 8192 M bytes of main memory (20200ns)
160 G 2,000 G bytes of disk storage (1 G byte = 1000 M
bytes)
A typical memory configuration might be as shown:
MAIN
MEMORY
CPU
CACHE
DISK
CNTRL
DISK
DISK
73
Cache memory
The cache is a small amount of high-speed memory, usually with a
memory cycle time comparable to the time required by the CPU to
fetch one instruction. The cache is usually filled from main memory
when instructions or data are fetched into the CPU. Often the main
memory will supply a wider data word to the cache than the CPU
requires, to fill the cache more rapidly. The amount of information
which is replaces at one time in the cache is called the line size for the
cache. This is normally the width of the data bus between the cache
memory and the main memory. A wide line size for the cache means
that several instruction or data words are loaded into the cache at
one time, providing a kind of prefetching for instructions or data.
Since the cache is small, the effectiveness of the cache relies on the
following properties of most programs:
Spatial locality most programs are highly sequential; the next
instruction usually comes from the next memory location.
Data is usually structured. Also, several operations are performed on the same data values, or variables.
Temporal locality short loops are a common program structure, especially for the innermost sets of nested loops. This
means that the same small set of instructions is used over and
over, as are many of the data elements.
74
When a cache is used, there must be some way in which the memory
controller determines whether the value currently being addressed in
memory is available from the cache. There are several ways that this
can be accomplished. One possibility is to store both the address and
the value from main memory in the cache, with the address stored in
a type of memory called associative memory or, more descriptively,
content addressable memory.
An associative memory, or content addressable memory, has the
property that when a value is presented to the memory, the address
of the value is returned if the value is stored in the memory, otherwise
an indication that the value is not in the associative memory is returned. All of the comparisons are done simultaneously, so the search
is performed very quickly. This type of memory is very expensive,
because each memory location must have both a comparator and a
storage element. A cache memory can be implemented with a block
of associative memory, together with a block of ordinary memory.
The associative memory holds the address of the data stored in the
cache, and the ordinary memory contains the data at that address.
address (input)
...
...
75
comparator
stored address
Such a fully associative cache memory might be configured as shown:

ASSOCIATIVE
ORDINARY
MEMORY
MEMORY
r
r
r
r
r
r
r
r
r
r
r
r
address
data
(input)
(output)
If the address is not found in the associative memory, then the value
is obtained from main memory.
Associative memory is very expensive, because a comparator is required for every word in the memory, to perform all the comparisons
in parallel.
76
A cheaper way to implement a cache memory, without using expensive associative memory, is to use direct mapping. Here, part of
the memory address (the low order digits of the address) is used to
address a word in the cache. This part of the address is called the
index. The remaining high-order bits in the address, called the tag,
are stored in the cache memory along with the data.
For example, if a processor has an 18 bit address for memory, and
a cache of 1 K words of 2 bytes (16 bits) length, and the processor
can address single bytes or 2 byte words, we might have the memory
address field and cache organized as follows:
MEMORY ADDRESS
17
1110
TAG
INDEX
0
1
2
TAG
1 0
INDEX
BYTE 0
BYTE
BYTE 1
q
q
q
q
q
q
q
q
q
1023
Parity Bits
77
Valid Bit
This was, in fact, the way the cache was organized in the PDP-11/60.
In the 11/60, however, there are 4 other bits used to ensure that the
data in the cache is valid. 3 of these are parity bits; one for each byte
and one for the tag. The parity bits are used to check that a single
bit error has not occurred to the data while in the cache. A fourth
bit, called the valid bit is used to indicate whether or not a given
location in cache is valid.
In the PDP-11/60 and in many other processors, the cache is not
updated if memory is altered by a device other than the CPU (for
example when a disk stores new data in memory). When such a
memory operation occurs to a location which has its value stored
in cache, the valid bit is reset to show that the data is stale and
does not correspond to the data in main memory. As well, the valid
bit is reset when power is first applied to the processor or when the
processor recovers from a power failure, because the data found in
the cache at that time will be invalid.
78
In the PDP-11/60, the data path from memory to cache was the
same size (16 bits) as from cache to the CPU. (In the PDP-11/70,
a faster machine, the data path from the CPU to cache was 16 bits,
while from memory to cache was 32 bits which means that the cache
had effectively prefetched the next instruction, approximately half
of the time). The number of consecutive words taken from main
memory into the cache on each memory fetch is called the line size
of the cache. A large line size allows the prefetching of a number
of instructions or data words. All items in a line of the cache are
replaced in the cache simultaneously, however, resulting in a larger
block of data being replaced for each cache miss.
INDEX TAG
0
1
2
3
4
WORD 0
WORD 1
r
r
r
r
r
r
1023
17
Memory address
1211
Tag
2 10
Index
79
Byte in word
Word in line
For a similar 2K word (or 8K byte) cache, the MIPS processor would
typically have a cache configuration as follows:
INDEX
0
1
2
3
4
TAG
WORD 0
WORD 1
byte 3 byte 2 byte 1 byte 0 byte 3 byte 2 byte 1 byte 0
1023
Memory address
13 12
31
2 1 0
...
Tag
Index
Byte in word
Word in line
Generally, the MIPS cache would be larger (64Kbytes would be typical, and line sizes of 1, 2 or 4 words would be typical).
80
A characteristic of the direct mapped cache is that a particular

memory address can be mapped into only one cache location.
Many memory addresses are mapped to the same cache location (in
fact, all addresses with the same index field are mapped to the same
cache location.) Whenever a cache miss occurs, the cache line will
be replaced by a new line of information from main memory at an
address with the same index but with a different tag.
Note that if the program jumps around in memory, this cache
organization will likely not be effective because the index range is
limited. Also, if both instructions and data are stored in cache, it
may well happen that both map into the same area of cache, and
may cause each other to be replaced very often. This could happen,
for example, if the code for a matrix operation and the matrix data
itself happened to have the same index values.
81
A more interesting configuration for a cache is the set associative

cache, which uses a set associative mapping. In this cache organization, a given memory location can be mapped to more than one cache
location. Here, each index corresponds to two or more data words,
each with a corresponding tag. A set associative cache with n tag
and data fields is called an nway set associative cache. Usually
n = 2k , for k = 1, 2, 3 are chosen for a set associative cache (k = 0
corresponds to direct mapping). Such nway set associative caches
allow interesting tradeoff possibilities; cache performance can be improved by increasing the number of ways, or by increasing the line
size, for a given total amount of memory. An example of a 2way set
associative cache is shown following, which shows a cache containing
a total of 2K lines, or 1 K sets, each set being 2way associative.
(The sets correspond to the rows in the figure.)
INDEX
TAG 0
LINE 0
TAG 1
LINE 1
0
1
2
r
r
r
1023
r
r
r
r
r
r
r r
r r
r r
82
r
r
r
r
r
r
In a 2-way set associative cache, if one data line is empty for a read
operation corresponding to a particular index, then it is filled. If both
data lines are filled, then one must be overwritten by the new data.
Similarly, in an n-way set associative cache, if all n data and tag fields
in a set are filled, then one value in the set must be overwritten, or
replaced, in the cache by the new tag and data values. Note that an
entire line must be replaced each time.
83
The line replacement algorithm

The most common replacement algorithms are:
Random the location for the value to be replaced is chosen at
random from all n of the cache locations at that index position.
In a 2-way set associative cache, this can be accomplished with a
single modulo 2 random variable obtained, say, from an internal
clock.
First in, first out (FIFO) here the first value stored in the
cache, at each index position, is the value to be replaced. For
a 2-way set associative cache, this replacement strategy can be
implemented by setting a pointer to the previously loaded word
each time a new word is stored in the cache; this pointer need
only be a single bit. (For set sizes > 2, this algorithm can be
implemented with a counter value stored for each line, or index
in the cache, and the cache can be filled in a round robin
fashion).
84
Least recently used (LRU) here the value which was actually used least recently is replaced. In general, it is more likely
that the most recently used value will be the one required in the
near future. For a 2-way set associative cache, this is readily
implemented by setting a special bit called the USED bit for
the other word when a value is accessed while the corresponding
bit for the word which was accessed is reset. The value to be
replaced is then the value with the USED bit set. This replacement strategy can be implemented by adding a single USED bit
to each cache location. The LRU strategy operates by setting a
bit in the other word when a value is stored and resetting the
corresponding bit for the new word. For an n-way set associative
cache, this strategy can be implemented by storing a modulo n
counter with each data word. (It is an interesting exercise to
determine exactly what must be done in this case. The required
circuitry may become somewhat complex, for large n.)
85
Cache memories normally allow one of two things to happen when

data is written into a memory location for which there is a value
stored in cache:
Write through cache both the cache and main memory are
updated at the same time. This may slow down the execution
of instructions which write data to memory, because of the relatively longer write time to main memory. Buffering memory
writes can help speed up memory writes if they are relatively
infrequent, however.
Write back cache here only the cache is updated directly by
the CPU; the cache memory controller marks the value so that it
can be written back into memory when the word is removed from
the cache. This method is used because a memory location may
often be altered several times while it is still in cache without
having to write the value into main memory. This method is
often implemented using an ALTERED bit in the cache. The
ALTERED bit is set whenever a cache value is written into by
the processor. Only if the ALTERED bit is set is it necessary to
write the value back into main memory (i.e., only values which
have been altered must be written back into main memory).
The value should be written back immediately before the value
is replaced in the cache.
The MIPS R2000/3000 processors used the write-through approach,
with a buffer for the memory writes. (This was also the approach
86
taken by the The VAX-11/780 processor ) In practice, memory writes

are less frequent than memory reads; typically for each memory write,
an instruction must be fetched from main memory, and usually two
operands fetched as well. Therefore we might expect about three
times as many read operations as write operations. In fact, there
are often many more memory read operations than memory write
operations.
87
Real cache performance

The following figures show the behavior (actually the miss ratio,
which is equal to 1 the hit ratio) for direct mapped and set associative cache memories with various combinations of total cache
memory capacity, line size and degree of associativity.
The graphs are from simulations of cache performance using cache
traces collected from the SPEC92 benchmarks, for the paper Cache
Performance of the SPEC92 Benchmark Suite, by J. D. Gee, M. D.
Hill, D. N. Pnevmatikatos and A. J. Smith, in IEEE Micro, Vol. 13,
Number 4, pp. 17-27 (August 1993).
The processor used to collect the traces was a SUN SPARC processor,
which has an instruction set architecture similar to the MIPS.
The data is from benchmark programs, and although they are real
programs, the data sets are limited, and the size of the code for the
benchmark programs may not reflect the larger size of many newer
or production programs.
The figures show the performance of a mixed cache. The paper shows
the effect of separate instruction and data caches as well.
88
Miss ratio vs. Line size

Direct mapped
Cache size
1K
2K
4K
8K
Miss Ratio
0.1
16 K
32 K
64 K
128 K
256 K
512 K
1024 K
0.01
0.001
16
32
64
128
256
Line size (bytes)

This figure shows that increasing the line size usually decreases the
miss ratio, unless the line size is a significant fraction of the cache
size (i.e., the cache should contain more than a few lines.)
Note that increasing the line size is not always effective in increasing the throughput of the processor, because of the additional time
required to transfer large lines of data from main memory.
89
Miss ratio vs. cache size

Line size (bytes)
16
32
64
128
256
Miss Ratio
0.1
0.01
0.001
1
10
100
1000
Cache size (Kbytes)

This figure shows that the miss ratio drops consistently with cache
size. (The plot is for a direct mapped cache, using the same data as
the previous figure, replotted to show the effect of increasing the size
of the cache.)
90
Miss ratio vs. Way size

Cache size
Miss Ratio
0.1
1K
2K
4K
8K
16 K
32 K
64 K
128 K
256 K
512 K
1024 K
0.01
0.001
1
full
Way size (bytes)

For large caches the associativity, or way size, becomes less important than for smaller caches.
Still, the miss ratio for a larger way size is always better.
91
Miss ratio vs. cache size

associativity
direct
2-way
4-way
8-way
full
Miss Ratio
0.1
0.01
0.001
1
10
100
1000
Cache size (Kbytes)

This is the previous data, replotted to show the effect of cache size
for different associativities.
Note that direct mapping is always significantly worse than even
2-way set associative mapping.
This is important even for a second level cache.
92
What happens when there is a cache miss?

A cache miss on an instruction fetch requires that the processor
stall or wait until the instruction is available from main memory.
A cache miss on a data word read may be less serious; instructions
can, in principle, continue execution until the data to be fetched is
actually required. In practice, data is used almost immediately after
it is fetched.
A cache miss on a data word write may be even less serious; if the
write is buffered, the processor can continue until the write buffer is
full. (Often the write buffer is only one word deep.)
If we know the miss rate for reads in a cache memory, we can calculate
the number of read-stall cycles as follows:
Read-stall cycles = Reads Read miss rate Read miss penalty
For writes, the expression is similar, except that the effect of the
write buffer must be added in:
Write-stall cycles = (Writes Write miss rate Write miss penalty)
+Write buffer stalls
If the penalties are the same for a cache read or write, then we have
Memory-stall cycles = Memory accesses Cache miss rate

Cache miss penalty
93
Example:
Assume a cache miss rate of 5%, (a hit rate of 95%) with cache
memory of 1ns cycle time, and main memory of 35ns cycle time. We
can calculate the average cycle time as
(1 0.05) 1ns + 0.05 35ns = 2.7ns
The following table shows the effective memory cycle time as a function of cache hit rate for the system in the above example:
Cache hit % Effective cycle time (ns)
80
7.8
85
6.1
90
4.4
95
2.7
98
1.68
99
1.34
100
Note that there is a substantial performance penalty for a high cache

miss rate (or low hit rate).
94
Examples the VAX 3500 and the MIPS R2000
Both the VAX 3500 and the MIPS R2000 processors have interesting cache structures, and were marketed at the same time.
(Interestingly, neither of the parent companies which produced these
processors are now independent companies. Digital Equipment Corporation was acquired by Compaq, which in turn was acquired by
Hewlett Packard. MIPS was acquired by Silicon Graphics Corporation).
The VAX 3500 has two levels of cache memory a 1 Kbyte 2-way
set associative cache is built into the processor chip itself, and there
is an external 64 Kbyte direct mapped cache. The overall cache hit
rate is typically 95 to 99%. If there is an on-chip (first level) cache
hit, the external memory bus is not used by the processor. The first
level cache responds to a read in one machine cycle (90ns), while the
second level cache responds within two cycles. Both caches can be
configured as caches for instructions only, for data only, or for both
instructions and data. In a single processor system, a mixed cache is
typical; in systems with several processors and shared memory, one
way of ensuring data consistency is to cache only instructions (which
are not modified); then all data must come from main memory, and
consequently whenever a processor reads a data word, it gets the
current value.
95
The behavior of a two-level cache is quite interesting; the second

level cache does not see the high memory locality typical of a
single level cache; the first level cache tends to strip away much of
this locality. The second level cache therefore has a lower hit rate
than would be expected from an equivalent single level cache, but
the overall performance of the two-level system is higher than using
only a single level cache. In fact, if we know the hit rates for the two
caches, we can calculate the overall hit rate as H = H1 +(1H1)H2,
where H is the overall hit rate, and H1 and H2 are the hit rates for
the first and second level caches, respectively. DEC claims1 that the
hit rate for the second level cache is about 85%, and the first level
cache has a hit rate of over 80%, so we would expect the overall hit
rate to be about 80% + (20% 80%) = 97%.
See C. J. DeVane, Design of the MicroVAX 3500/3600 Second Level Cache in the Digital Technical
Journal, No. 7, pp. 87 94 for a discussion of the performance of this cache.
96
The MIPS R2000 has no on-chip cache, but it has provision for the
addition for up to 64 Kbytes of instruction cache and 64 Kbytes
of data cache. Both caches are direct mapped. Separation of the
instruction and data caches is becoming more common in processor
systems, especially for direct mapped caches. In general, instructions
tend to be clustered in memory, and data also tend to be clustered,
so having separate caches reduces cache conflicts. This is particularly
important for direct mapped caches. Also, instruction caches do not
need any provision for writing information back into memory.
Both processors employ a write-through policy for memory writes,
and both provide some buffering between the cache and memory,
so processing can continue during memory writes. The VAX 3500
provides a quadword buffer, while the buffer for the MIPS R2000
depends on the particular system in which it is used. A small write
buffer is normally adequate, however, since writes are relatively much
less frequent than reads.
97
Simulating cache memory performance

Since much of the effectiveness of the system depends on the cache
miss rate, it is important to be able to measure, or at least accurately
estimate, the performance of a cache system early in the system
design cycle.
Clearly, the type of jobs (the job mix) will be important to the
cache simulation, since the cache performance can be highly data
and code dependent. The best simulation results come from actual
job mixes.
Since many common programs can generate a large number of memory references, (document preparation systems like LATEX, for example), the data sets for cache traces for typical jobs can be very
large. In fact, large cache traces are required for effective simulation
of even moderate sized caches.
98
For example, given a cache size of 8K lines with an anticipated miss

rate of, say, 10%, we would require about 80K lines to be fetched
from memory before it could reasonably be expected that each line in
the cache was replaced. To determine reasonable estimates of actual
cache miss rates, each cache line should be replaced a number of times
(the accuracy of the determination depends on the number of such
replacements.) This net effect is to require a memory trace of some
factor larger, say another factor of 10, or about 800K lines. That is,
the trace length would be at least 100 times the size of the cache.
Lower expected cache miss rates and larger cache sizes exacerbate
this problem. (e.g., for a cache miss rate of 1%, a trace of 100 times
the cache size would be required to, on average, replace each line
in the cache once. A further, larger, factor would be required to
determine the miss rate to the required accuracy.)
99
The following two results (see High Performance Computer Architecture by H.S. Stone, Addison Wesley, Chapter 2, Section 2.2.2, pp.
5770) derived by Puzak, in his Ph.D. thesis (T.R. Puzak, Cache
Memory Design, University of Massachusetts, 1985) can be used to
reduce the size of the traces and still result in realistic simulations.
The first trace reduction, or trace stripping, technique assumes that
a series of caches of related sizes starting with a cache of size N, all
with the same line size, are to be simulated with some cache trace.
The cache trace is reduced by retaining only those memory references
which result in a cache miss for a direct mapped cache.
Note that, for a miss rate of 10%, 90% of the memory trace would
be discarded. Lower miss rates result in higher reductions.
The reduced trace will produce the same number of cache misses as
the original trace for:
A K-way set associative cache with N sets and line size L
A one-way set associative cache with 2N sets and line size L
(provided that N is a power of 2)
In other words, for caches with size some power of 2, it is possible
to investigate caches of with sizes a multiple of the initial cache size,
and with arbitrary set associativity using the same reduced trace.
100
The second trace reduction technique is not exact; it relies on the

observation that generally each of the N sets behaves statistically
like any other set; consequently observing the behavior of a small
subset of the cache sets is sufficient to characterize the behavior of
the cache. (The accuracy of the simulation depends somewhat on
the number of sets chosen, because some sets may actually have
behaviors quite different from the average.) Puzak suggests that
choosing about 10% of the sets in the initial simulation is sufficient.
Combining the two trace reduction techniques typically reduces the
number of memory references required for the simulation of successive
caches by a factor of 100 or more. This gives a concomitant speedup
of the simulation, with little loss in accuracy.
101
Other methods for fast memory access

There are other ways of decreasing the effective access time of main
memory, in addition to the use of cache.
Some processors have circuitry which prefetches the next instruction from memory while the current instruction is being executed.
Most of these processors simply prefetch the next instructions from
memory; others check for branch instructions and either attempt to
predict to which location the branch will occur, or fetch both possible instructions. (The VAX 3500 has a 12 byte prefetch queue,
which it attempts to keep full by prefetching the next instructions in
memory.)
In some processors, instructions can remain in the queue after they
have been executed. This allows the execution of small loops without
additional instructions being fetched from memory.
Another common speed enhancement is to implement the backwards
jump in a loop instruction while the conditional expression is being
evaluated; usually the jump is successful, because the loop condition
fails only when the loop execution is finished.
102
Interleaved memory
In large computer systems, it is common to have several sets of data
and address lines connected to independent banks of memory, arranged so that adjacent memory words reside in different memory
banks. Such memory system are called interleaved memories, and
allow simultaneous, or time overlapped, access to adjacent memory
locations. Memory may be n-way interleaved, where n is usually a
power of two. 2, 4 and 8-way interleaving is common in large mainframes. In such systems, the cache size typically would be sufficient
to contain a data word from each bank. The following diagram shows
an example of a 4-way interleaved memory.
C
P
U
bank 3
bank 2
bank 1
bank 0
C
P
U
bank 3
bank 2
bank 1
bank 0
Memory bus
(a)
(b)
Here, the processor may require 4 sets of data busses, as shown in

Figure (a). At some reduced performance, it is possible to use a single
data bus, as shown in Figure (b). The reduction is small, however,
because all banks can fetch their data simultaneously, and present
the data to the processor at a high data rate.
103
In order to model the expected gain in speed by having an interleaved

memory, we make the simplifying assumption that all instructions are
of the type
Ri Rj op Mp[EA]
where Mp[EA] is the content of memory at location EA, the effective
address of the instruction (i.e., we ignore register-to-register operations). This is a common instruction format for supercomputers, but
is quite different from the RISC model. We can make a similar model
for RISC machines; here we need only model the fetching of instructions, and the LOAD and STORE instructions. The model does not
apply directly to certain types of supercomputers, but again can be
readily modified.
104
Here we can have two cases; case (a), where the execution time is
less than the full time for an operand fetch, and case (b) where the
execution time is greater than the time for an operand fetch. The
following figures (a) and (b) show cases (a) and (b) respectively,
I-fetch
ta
Operand fetch
ts
ta
td
ts
ta
s s s
tea
(a) tea ts
I-fetch
ta
I-fetch
tia = 2tc
Operand fetch
ts
ta
ts
td
s s s
I-fetch
ta
teb
(b) teb ts
tib = 2tc + (teb ts)
where ta is the access time, ts is the stabilization time for the

memory bus, tc is the memory cycle time (tc = ta + ts), td is the
instruction decode time, and te is the instruction execution time.
The instruction time in this case is
ti = 2tc + fb (teb ts)
where fb is the relative frequency of instructions of type (b).
105
With an interleaved memory, the time to complete an instruction can

be improved. The following figure shows an example of interleaving
the fetching of instructions and operands.
I-fetch
ta
I-fetch
ta
ts
td
s s s
Operand fetch
ta
ts
tea
Note that this example assumes that there is no conflict the instruction and its operand are in separate memory banks. For this
example, the instruction execution time is
ti = 2ta + td + te
If ta ts and te is small, then ti(interleaved) 12 ti(non interleaved).
106
The previous examples assumed no conflicts between operand and

data fetches. We can make a (pessimistic) assumption that each of
the N memory modules is equally likely to be accessed. Now there
are two potential delays,
1. the operand fetch, with delay length ts td, and this has probability 1/N
2. the next instruction fetch, with delay length ts te (if te ts ),
with probability 1/N .
We can revise the earlier calculation for the instruction time, by
adding both types of delays, to
ti = 2ta + td + tc + 1/N (ts td ) + f 1/N (ts te)
where f is the frequency for instructions where ta ts.
Typically, instructions are executed serially, until a branch instruction is met, which disrupts the sequential ordering. Thus, instructions will typically conflict only after a successful branch. If is
the frequency of such branches, then the probability of executing K
instructions in a row is (1 - )K .
PK = (1 )K1
is the probability of a sequence of K 1 sequential instructions
followed by a branch.
107
The expected number of instructions to be executed in serial order is

IF =
N
X
K(1 )K1
K=1
1
= [1 (1 )N ]
where N is the number of interleaved memory banks. IF is, effectively, the number of memory banks being used.
Example:
If N = 4, and = 0.1 then
IF = 1/0.1(1 (1 0.1)4)
= 10(1 0.94)
= 3.4
For operands, a simple (but rather pessimistic) thing is to assume
that the data is randomly distributed in the memory banks. In this
case, the probability Q(K) of a string of length K is:
K
(N 1)!K
N N 1 N 2
=
N
N
N
N
(N K)!N K
and the average number of operand fetches is
Q(k) =
OF =
N
X
K=1
1
which can be shown to be O(N 2 ).
108
(N 1)!K
(N K)!N K
A Brief Introduction to Operating

Systems
What is an operating system?

An operating system is a set of manual and automatic procedures that enable a group of people to share a computer installation efficiently Per Brinch Hansen, in Operating System
Principles (Prentice Hall, 1973)
An operating system is a set of programs that monitor the execution of user programs and the use of resources A. Haberman, in Introduction to Operating System Design (Science
Research Associates, 1976)
An operating system is an organized collection of programs that
acts as an interface between machine hardware and users, providing users with a set of facilities to simplify the design, coding,
debugging and maintenance of programs; and, at the same time,
controlling the allocation of resources to assure efficient operation Alan Shaw, in The Logical Design of Operating
Systems (Prentice Hall, 1974)
109
Typically, more modern texts do not define the term operating

system, they merely specify some of the aspects of operating systems.
Usually two aspects receive most attention:
resource allocation control access to memory, I/O devices, etc.
the provision of useful functions and programs (e.g., to print files,
input data, etc.)
We will be primarily concerned with the resource management aspects of the operating system.
Resources which require management include:
CPU usage
Main memory (memory management)
the file system (here we may have to consider the structure of
the file system itself)
the various input and output devices (terminals, printers, plotters, etc.)
communication channels (network service, etc.)
Error detection, protection, and security
110
In addition to resource management (allocation of resources) the

operating system must ensure that different processes do not have
conflicts over the use of particular resources. (Even simple resource
conflicts can result in such things as corrupted file systems or process
deadlocks.)
This is a particularly important consideration when two or more
processes must cooperate in the use of one or more resources.
Processes
We have already used the term process as an entity to which the
operating system allocates resources. At this point, it is worth while
to define the term process more clearly.
A process is a particular instance of a program which is executing. It
includes the code for the program, the current value of the program
counter, all internal registers, and the current value of all variables
associated with the program (i.e, the memory state.)
Different (executing) instances of the same program are different processes.
In some (most) systems, the output of one process can be used as an
input to another process (such as a pipe, in UNIX); e.g.,
cat
file1
file2
sort
Here there are two processes, cat and sort, with their data specified.
When this command is executed, the processes cat and sort are
particular instances of the programs cat and sort.
111
Note that these two processes can exist in at least 3 states: active,
or running; ready to run, but temporarily stopped because the other
process is running; or blocked waiting for data from another process.
active
blocked
ready
The transitions have the following meanings:

(1) blocked, waiting for data (2), (3) another process becomes active
(4) data has become available.
Transition (2) happens as a result of the process scheduling algorithm.
Real operating systems have somewhat more complex process state
diagrams.
As well, for multiprocessing systems, the hardware must support
some mechanism to provide at least two modes of operation, say a
kernel mode and a user mode.
Kernel mode has instructions (say, to handle shared resources) which
are unavailable in user mode.
112
Following is a simplified process state diagram for the UNIX operating system:
user
running
sys call
or interrupt
death
return
interrupt return
kill
sleep
asleep
kernel
running
birth
schedule
process
start process
ready
wakeup
Note that, in the UNIX system, a process executes in either user or

kernel mode.
In UNIX/LINUX, system calls provide the interface between user
programs and the operating system.
Typically, a programmer uses an API (application program interface), which specifies a set of functions available to the programmer.
A common API for UNIX/LINUX systems is the POSIX standard.
113
The system moves from user mode to kernel mode as a result of an

interrupt, exception, or system call.
As an example of a system call. consider the following C code:
int main()
{
.
.
.
printf{"Hello world"};
.
.
return(0);
}
Here, the function printf() is a system call.
Inside the kernel, it calls other functions (notably, write()) which
perform the operations required to execute the function.
The underlying functions are not part of the API.
114
Passing parameters in system calls

Many system calls (printf(), for example) require arguments to be
passed to the kernel. There are three common methods for passing
parameters:
1. Pass the values in registers (as in a MIPS function call)
2. Pass the values on the stack (as MIPS does when there are more
than four arguments.)
3. Store the parameters in a block, or table, and pass the address
of the block in a register. (This is what Linux does.)
The method for parameter passing is generally established for the
operating system, and all code for the system follows the same pattern.
115
Styles of operating systems

There are several styles, or philosophies in creating operating systems:
Monolithic Here, the operating is a large, single entity.
Examples are UNIX, WINDOWS.
Layered Here, the hardware is the bottom layer, and successively
more abstract layers are built up.
Higher layers strictly invoke functions from the layer immediately
below.
This makes extending the kernel function more organized, and
potentially more portable.
Example: Linux is becoming more layered, with more device
abstraction.
Layer N
Layer N1
Hardware
116
Micro-kernel Here, as much as possible is moved into the user

space, keeping the kernel as small as possible.
This makes it easier to extend the kernel. Also, since it is smaller,
it is easier to port to other architectures.
Most modern operating systems implement kernel modules.
Modules are useful (but slightly less efficient) because
they use an object-oriented approach
each core component is separate
each communicates with the others using well-defined interfaces
they may be separately loaded, possibly on demand
All operating systems have the management of resources as their
primary function.
117
One of the most fundamental resources to be allocated among processes (in a single CPU system) is the main memory.
A number of allocation strategies are possible:
(1) single storage allocation here all of main memory (except
for space for the operating system nucleus, or kernel) is given to
the current process.
Two problems can arise here:
1. the process may require less memory than is available (wasteful
of memory)
2. the process may require more memory than is available.
The second is a serious problem, which can be addressed in several
ways. The simplest of these is by static overlay, where a block of
data or code not currently required is overwritten in memory by the
required code or data.
This was originally done by the programmer, who embedded commands to load the appropriate blocks of code or data directly in the
source code.
Later, loaders were available which analyzed the code and data blocks
and loaded the appropriate blocks when required.
This type of memory management is still used in primitive operating
systems (e.g., DOS).
118
Early memory management static overlay done under user

program control:
The graph shows the functional dependence of code segments.
1
16k
16k
2
8k
12k
14k
32k
20k
16k
48k
7
64k
12k
12k
9
20k
80k
Clearly, segments at the same level in the tree need not be memory
resident at the same time. e.g., in the above example, it would be
appropriate to have segments (1,3,9) and (5,7) in memory simultaneously, but not, say, (2,3).
119
(2) Contiguous Allocation

In the late 1960s, operating systems began to control, or manage
more resources, including memory. The first attempts used very
simple memory management strategies.
One very early system was Fixed-Partition Allocation:
40k
Kernel
35k
Job 1
50k
75k
Job 2
waste
43k
Job 3
waste
68k
This system did not offer a very efficient use of memory; the systems
manager had to determine an appropriate memory partition, which
was then fixed. This limited the number of processes, and the mix
of processes which could be run at any given time.
Also, in this type of system, dynamic data structures pose difficulties.
120
An obvious improvement over fixed-partition allocation was MovablePartition Allocation

40k Kernel
40k Kernel
40k Kernel
40k Kernel
20k
Job 1
20k
Job 1
20k
Job 1
20k
Free
50k
Job 2
80k
Free
65k
Job 5
65k
Job 5
30k
Free
15k
Free
15k
Free
40k
Job 4
40k
Job 4
40k
Job 4
40k
Job 4
20k
Job 3
20k
Job 3
20k
Job 3
20k
Job 3
Here, dynamic data structures are still a problem jobs are placed
in areas where they fit at the time of loading.
A new problem here is memory fragmentation it is usually
much easier to find a block of memory for a small job than for a large
job. Eventually, memory may contain many small jobs, separated by
holes too small for any of the queued processes.
This effect may seriously reduce the chances of running a large job.
121
One solution to this problem is to allow dynamic reallocation of

processes running in memory. The following figure shows the result
of dynamic reallocation of Job 5 after Job 1 terminates:
40k
Kernel
20k
Free
40k
Kernel
65k
Job 5
35k
Free
65k
Job 5
15k
Free
40k
Job 4
40k
Job 4
20k
Job 3
20k
Job 3
In this system, the whole program must be moved, which may have a
penalty in execution time. This is a tradeoff how frequently memory should be compacted against the performance lost to memory
fragmentation.
Again, dynamic memory allocation is still difficult, but less so than
for the other systems.
122
Modern processors generally manage memory using a scheme called

virtual memory here all processes appear to have access to all
of the memory available to the system. A combination of special
hardware and the operating system maintains some parts of each
process in main memory, but the process is actually stored on disk
memory.
(Main memory acts somewhat like a cache for processes only the
active portion of the process is stored there. The remainder is loaded
as needed, by the operating system.)
We will look in some detail at how processes are mapped from
virtual memory into physical memory.
The idea of virtual memory can be applied to the whole processor, so
we can think of it as a virtual system, where every process has access
to all system resources, and where separate (non-communicating)
processes cannot interfere with each other.
In fact, we are already used to thinking of computers in this way.
We are familiar with the sharing of physical resources like printers
(through the use of a print queue) as well as sharing access to the
processor itself in a multitasking environment.
123
Virtual Memory Management

Because main memory (i.e., transistor memory) is much more expensive, per bit than disk memory, it is usually economical to provide
most of the memory requirements of a computer system as disk memory. Disk memory is also permanent and not (very) susceptible to
such things as power failure. Data, and executable programs, are
brought into memory, or swapped as they are needed by the CPU
in much the same way as instructions and data are brought into the
cache. Disk memory has a long seek time relative to random access
memory, but it has a high data rate after the targeted block is found
(a high burst transfer rate.)
Most large systems today implement this memory management using a hardware memory controller in combination with the operating
system software.
In effect, modern memory systems are implemented as a hierarchy,
with slow, cheap disk memory at the bottom, single transistor main
memory at the next level, and high speed cache memory at the next
higher level. There may be more than one level of cache memory.
Virtual memory is invariably implemented as automatic, user transparent scheme.
124
The process of translating, or mapping, a virtual address into a physical address is called virtual address translation. The following
diagram shows the relationship between a named variable and its
physical location in the system.
Logical
name
Name space
Virtual
address
Logical address space
Physical address space
Physical
address
125
This mapping can be accomplished in ways similar to those discussed

for mapping main memory into the cache memory. In the case of virtual address mapping, however, the relative speed of main memory to
disk memory (a factor of approximately 100,000 to 1,000,000) means
that the cost of a miss in main memory is very high compared
to a cache miss, so more elaborate replacement algorithms may be
worthwhile.
There are two flavours of virtual memory mapping; paged memory
mapping and segmented memory mapping. We will look at both in
some detail.
Virtually all processors today use paged memory mapping, In most
systems, pages are placed in memory when addressed by the program
this is called demand paging.
In many processors, a direct mapping scheme is supported by the
system hardware, in which a page map is maintained in physical
memory. This means that each physical memory reference requires
both an access to the page table and and an operand fetch (two
memory references per instruction). In effect, all memory references
are indirect.
126
The following diagram shows a typical virtual-to-physical address

mapping:
Virtual address
Virtual page number
offset
Page
Map
Physical page
offset
number
Base address of Page
(physical memory)
Note that whole page blocks in virtual memory are mapped to whole
page blocks in physical memory.
This means that the page offset is part of both the virtual and physical address.
127
Requiring two memory fetches for each instruction is a large performance penalty, so most virtual addressing systems have a small
associative memory (called a translation lookaside buffer, or TLB)
which contains the last few virtual addresses and their corresponding physical addresses. Then for most cases the virtual to physical
mapping does not require an additional memory access. The following diagram shows a typical virtual-to-physical address mapping in
a system containing a TLB:
Virtual address
Virtual page number
offset
Page hit
in TLB
TLB
Page miss in TLB
Page
Map
Physical page
offset
number
Base address of Page
(physical memory)
128
For many current architectures, including the INTEL PENTIUM,

and MIPS, addresses are 32 bits, so the virtual address space is 232
bytes, or 4 G bytes (4,000 Mbytes). A physical memory of about 256
Mbytes2 Gbytes is typical for these machines, so the virtual address
translation must map the 32 bits of the virtual memory address into
a corresponding area of physical memory.
A recent trend (Pentium P4, UltraSPARC, PowerPC 9xx, MIPS
R16000, AMD Opteron) is to have a 64 bit address space, so the
maximum virtual address space is 264 bytes (17,179,869,184 Gbytes).
Sections of programs and data not currently being executed normally
are stored on disk, and are brought into main memory as necessary.
If a virtual memory reference occurs to a location not currently in
physical memory, the execution of that instruction is aborted, and
can be restored again when the required information is placed in
main memory from the disk by the memory controller. (Note that,
when the instruction is aborted, the processor must be left in the
same state it would have been had the instruction not been executed
at all).
129
While the memory controller is fetching the required information

from disk, the processor can be executing another program, so the
actual time required to find the information on the disk (the disk
seek time) is not wasted by the processor. In this sense, the disk seek
time usually imposes little (time) overhead on the computation, but
the time required to actually place the information in memory may
impact the time the user must wait for a result. If many disk seeks
are required in a short time, however, the processor may have to wait
for information from the disk.
Normally, blocks of information are taken from the disk and placed
in the memory of the processor. The two most common ways of determining the sizes of the blocks to be moved into and out of memory
are called segmentation and paging, and the term segmented memory management or paged memory management refer to memory
management systems in which the blocks in memory are segments
or pages.
130
Mapping in the memory hierarchy
Per process
Virtual address
Virtual to
Physical Address
Translation
Physical address
Note that not all the virtual address blocks are in the physical memory at the same time. Furthermore, adjacent blocks in virtual memory are not necessarily adjacent in physical memory.
If a block is moved out of physical memory and later replaced, it may
not be at the same physical address.
The translation process must be fast, most of the time.
131
Segmented memory management

In a segmented memory management system the blocks to be replaced in main memory are potentially of unequal length and correspond to program and data segments. A program segment might
be, for example, a subroutine or procedure. A data segment might
be a data structure or an array. In both cases, segments correspond
to logical blocks of code or data. Segments, then, are atomic, in
the sense that either the whole segment should be in main memory, or none of the segment should be there. The segments may be
placed anywhere in main memory, but the instructions or data in one
segment should be contiguous, as shown:
SEGMENT 1
SEGMENT 5
SEGMENT 7
SEGMENT 2
SEGMENT 4
SEGMENT 9
Using segmented memory management, the memory controller needs
to know where in physical memory is the start and the end of each
segment.
132
When segments are replaced, a single segment can only be replaced

by a segment of the same size, or by a smaller segment. After a time
this results in a memory fragmentation, with many small segments
residing in memory, having small gaps between them. Because the
probability that two adjacent segments can be replaced simultaneously is quite low, large segments may not get a chance to be placed
in memory very often. In systems with segmented memory management, segments are often pushed together occasionally to limit the
amount of fragmentation and allow large segments to be loaded.
Segmented memory management appears to be efficient because an
entire block of code is available to the processor. Also, it is easy for
two processes to share the same code in a segmented memory system;
if the same procedure is used by two processes concurrently, there
need only be a single copy of the code segment in memory. (Each
process would maintain its own, distinct data segment for the code
to access, however.)
Segmented memory management is not as popular as paged memory management, however. In fact, most processors which presently
claim to support segmented memory management actually support
a hybrid of paged and segmented memory management, where the
segments consist of multiples of fixed size blocks.
133
Paged memory management:

Paged memory management is really a special case of segmented
memory management. In the case of paged memory management,
all of the segments are exactly the same size (typically 256 bytes
to 16 M bytes)
virtual pages in auxiliary storage (disk) are mapped into fixed
page-sized blocks of main memory with predetermined page boundaries.
the pages do not necessarily correspond to complete functional
blocks or data elements, as is the case with segmented memory
management.
The pages are not necessarily stored in contiguous memory locations,
and therefore every time a memory reference occurs to a page which
is not the page previously referred to, the physical address of the new
page in main memory must be determined.
Most paged memory management systems maintain a page translation table using associative memory to allow a fast determination
of the physical address in main memory corresponding to a particular virtual address. Normally, if the required page is not found in
the main memory (i.e, a page fault occurs) then the CPU is interrupted, the required page is requested from the disk controller, and
execution is started on another process.
134
The following is an example of a paged memory management configuration using a fully associative page translation table:
Consider a computer system which has 16 M bytes (224 bytes) of main
memory, and a virtual memory space of 232 bytes. The following
diagram sketches the page translation table required to manage all
of main memory if the page size is 4K (212) bytes. Note that the
associative memory is 20 bits wide ( 32 bits 12 bits, the virtual
address size the page size). Also to manage 16 M bytes of memory
with a page size of 4 K bytes, a total of (16M )/(4K) = 212 = 4096
associative memory locations are required.
VIRTUAL ADDRESS
1211
31
VIRTUAL PAGE NO.
PHYSICAL 0
1
PAGE
ADDRESS 2
3
4r
31
0
BYTE IN PAGE
12
r
r qq
q
q
q
q
q
q
4095
ASSOCIATIVE MEMORY
135
Some other attributes are usually included in a page translation table, as well, by adding extra fields to the table. For example, pages
or segments may be characterized as read only, read-write, etc. As
well, it is common to include information about access privileges, to
help ensure that one program does not inadvertently corrupt data for
another program. It is also usual to have a bit (the dirty bit) which
indicates whether or not a page has been written to, so that the page
will be written back onto the disk if a memory write has occurred
into that page. (This is done only when the page is swapped,
because disk access times are too long to permit a write-through
policy like cache memory.) Also, since associative memory is very expensive, it is not usual to map all of main memory using associative
memory; it is more usual to have a small amount of associative memory which contains the physical addresses of recently accessed pages,
and maintain a virtual address translation table in main memory
for the remaining pages in physical memory. A virtual to physical
address translation can normally be done within one memory cycle
if the virtual address is contained in the associative memory; if the
address must be recovered from the virtual address translation table in main memory, at least one more memory cycle must be used
to retrieve the physical address from main memory.
136
There is a kind of trade-off between the page size for a system and
the size of the page translation table (PTT). If a processor has a
small page size, then the PTT must be quite large to map all of
the virtual memory space. For example, if a processor has a 32 bit
virtual memory address, and a page size of 512 bytes (29 bytes), then
there are 223 possible page table entries. If the page size is increased
to 4 Kbytes (212 bytes), then the PTT requires only 220, or 1 M
page table entries. These large page tables will normally not be very
full, since the number of entries is limited to the amount of physical
memory available.
One way these large, sparse PTTs are managed is by mapping the
PTT itself into virtual memory. (Of course, the pages which map
the virtual PTT must not be mapped out of the physical memory!)
There are also other pages that should not be mapped out of physical
memory. For example, pages mapping to I/O buffers. Even the I/O
devices themselves are normally mapped to some part of the physical
address space.
137
Note that both paged and segmented memory management provide the users of a computer system with all the advantages of a
large virtual address space. The principal advantage of the paged
memory management system over the segmented memory management system is that the memory controller required to implement a
paged memory management system is considerably simpler. Also,
the paged memory management does not suffer from fragmentation
in the same way as segmented memory management. Another kind
of fragmentation does occur, however. A whole page is swapped in or
out of memory, even if it is not full of data or instructions. Here the
fragmentation is within a page, and it does not persist in the main
memory when new pages are swapped in.
One problem found in virtual memory systems, particularly paged
memory systems, is that when there are a large number of processes
executing simultaneously as in a multiuser system, the main memory may contain only a few pages for each process, and all processes
may have only enough code and data in main memory to execute for
a very short time before a page fault occurs. This situation, often
called thrashing, severely degrades the throughput of the processor because it actually must spend time waiting for information to
be read from or written to the disk.
138
Examples the VAX 3500 and the MIPS R2000

These machines are interesting because the VAX 3500 was a typical
complex instruction set (CISC) machine, while the the MIPS R2000
was a classical reduced instruction set (RISC) machine.
VAX 3500
Both the VAX 3500 and the MIPS R2000 use paged virtual memory,
and both also have fast translation look-aside buffers which handle
many of the virtual to physical address translations. The VAX
3500, like other members of the VAX family, has a page size of 512
bytes. (This is the same as the number of sets in the on-chip cache, so
address translation can proceed in parallel with the cache access
another example of parallelism in this processor.) The VAX 3500
has a 28 entry fully associative translation look-aside buffer (TLB)
which uses an LRU algorithm for replacement. Address translation
for TLB misses is supported in the hardware (microcode); the page
table stored in main memory is accessed to find the physical addresses corresponding to the current virtual address, and the TLB is
updated.
139
MIPS R2000
The MIPS R2000 has a 4 Kbyte page size, and 64 entries in its fully
associative TLB, which can perform two translations in each machine
cycle one for the instruction to be fetched and one for the data
to be fetched or stored (for the LOAD and STORE instructions).
Unlike the VAX 3500 (and most other processors, including other
RISC processors), the MIPS R2000 does not handle TLB misses
using hardware. Rather, an exception (the TLB miss exception) is
generated, and the address translation is handled in software. In fact,
even the replacement of the entry in the TLB is handled in software.
Usually, the replacement algorithm chosen is random replacement,
because the processor generates a random number between 8 and 63
for this purpose. (The lowest 8 TLB locations are normally reserved
for the kernel; e.g., to refer to such things as the current PTT).
This is another example of the MIPS designers making a tradeoff
providing a larger TLB, thus reducing the frequency of TLB misses
at the expense of handling those misses in software, much as if they
were page faults.
A page fault, however, would cause the current process to be stopped
and another to be started, so the cost in time would be much higher
than a mere TLB miss.
140
Virtual memory replacement algorithms

Since page misses interrupt a process in virtual memory systems, it
is worthwhile to expend additional effort to reduce their frequency.
Page misses are handled in the system software, so the cost of this
added complexity is small.
Fixed replacement algorithms
Here, the number of pages for a process is fixed, constant. Some of
these algorithms are the same as those discussed for cache replacement. The common replacement algorithms are:
random page replacement (no longer used)
first-in first-out (FIFO)
clock replacement first-in not used first-out. A variation of
FIFO replacing blocks which have not been used in the recent
past (as determined by the clock) before replacing other blocks.
The order of replacement of those blocks is FIFO.
Least recently used replacement (this is probably the most common of the fixed replacement schemes)
Optimal replacement in a fixed partition (OPT). This is not
possible, in general, because it requires dynamic information
about the future behavior of a program. A particular code and
data set can be analyzed to determine the optimum replacement,
for comparison with other algorithms.
141
Generally, other considerations come into play for page replacement;

for example, it requires more time to replace a dirty page (i.e., one
which has been written into) than a clean page, because of the
time required to write the page back onto the disk. This may make
it more efficient to preferentially swap clean pages.
Most large disks today have internal buffers to speed up reading and
writing, and can accept several read and write requests, reordering
them for more efficient access.
The following diagram shows the performance of these algorithms
on a small sample program, with a small number of pages allocated.
Note that, in this example, the number of page faults for LRU <
CLOCK < FIFO.
page faults
(x 1000)
10
FIFO
CLOCK
LRU
OPT
8
6
4
2
6
8 10 12
Pages allocated
142
14
The replacement algorithms LRU and OPT have a useful property

known as the stack property. This property can be expressed as:
ftx ftx+1 ftx+2
where ft is the number of page faults in time t, and x is some initial
number of pages. This property means that the algorithm is wellbehaved in the sense that increasing the number of pages in memory
for the process always guarantees that the number of page faults for
the process will not increase. (FIFO and CLOCK do not have this
property, although in practice they improve with as the number of
pages allocated is increased.) For an algorithm with this property,
a page reference trace allows simulation of all possible numbers of
pages allocated at one time. It also allows a trace reduction process
similar to that for cache memory.
Generally, up to a point, a smaller page size is more effective than a
larger page size, reflecting the fact that most programs have a high
degree of locality. (This property is also what makes cache memory
so effective.)
The following diagram illustrates this behavior, for the replacement
algorithms discussed so far.
143
page faults
(x 1000)
10
FIFO
CLOCK
LRU
OPT
8
6
4
2
8
16
1024
512
Number of pages (fixed 8K memory)
32
256
Note that, when the page size is sufficiently small, the performance
degrades. In this (small) example, the small number of pages loaded
in memory degrade the performance severely for the largest page size
(2K bytes, corresponding to only 4 pages in memory.) Performance
improves with increased number of pages (of smaller size) in memory,
until the page size becomes small enough that a page doesnt hold
an entire logical block of code.
144
Variable replacement algorithms

In fixed replacement schemes, two anomalies can occur a program running in a small local region may access only a fraction of
the main memory assigned to it, or the program may require much
more memory than is assigned to it, in the short term. Both cases
are undesirable; the second may cause severe delays in the execution
of the program.
In variable replacement algorithms, the amount of memory available
to a process varies depending on the locality of the program.
The following diagram shows the memory requirements for two separate runs of the same program, using a different data set each time,
as a function of time (in clock cycles) as the program progresses.
memory
time
145
Working set replacement

A replacement scheme which accounts for this variation in memory
requirements dynamically may perform much better than a fixed
memory allocation scheme. One such algorithm is the working set
replacement algorithm. This algorithm uses a moving window in
time. Pages which are not referred to in this time are removed from
the working set.
For a window size T (measured in memory references), the working
set at time t is the set of pages which were referenced in the interval
(t T + 1, t). A page may be replaced when it no longer belongs to
the working set (this is not necessarily when a page fault occurs.)
146
Example:
Given a program with 7 virtual pages {a,b,. . . ,g} and the reference
sequence
a
with a window of 4 references. The following figure shows the sliding

window; the working set is the set of pages contained in this window.
a b a c g a f c g a f d b g
4
5
6
7
8
The following table shows the working set after each time period:
1 a
acgf
2 ab
acgf
3 ab
10 acgf
4 abc
11 acgf
5 abcg
12 agfd
6 acg
13 afdb
7 acgf
14 fdbg
147
A variant of the basic working set replacement, which replaces pages

only when there is a page fault, could do the following on a page
fault:
1. If all pages belong to the working set (i.e., have been accessed in
the window W time units prior to the page fault) then increase
the working set by 1 page.
2. If one or more pages do not belong to the working set (i.e., have
not been referenced in the window W time units prior to the
page fault) then decrease the working set by discarding the last
recently used page. If there is more than one page not in the
working set, discard the 2 pages which have been least recently
used.
The following diagram shows the behavior of the working set replacement algorithm relative to LRU.
148
Page
faults
LRU
WS
memory
Page fault frequency replacement
This is another method for varying the amount of physical memory
available to a process. It is based on the simple observation that,
when the frequency of page faults for a process increases above some
threshold, then more memory should be allocated to the process.
The page fault frequency (PFF) can be approximated by 1/(time
between page faults), although a better estimate can be gotten by
averaging over a few page faults.
A PFF implementation must both increase the number of pages if the
PFF is higher than some threshold, and must also lower the number
of pages in some way. A reasonable policy might be the following:
149
Increase the number of pages allocated to the process by 1 whenever the PFF is greater than some threshold Th .
Decrease the number of pages allocated to the process by 1 whenever the PFF is less than some threshold Tl .
If Tl < PFF < Th, then replace a page in memory by some other
reasonable policy; e.g., LRU.
The thresholds Th and Tl should be system parameters, depending
on the amount of physical memory available.
An alternative policy for decreasing the number of pages allocated
to a process might be to decrease the number of pages allocated to
a process when the PFF does not exceed T for some period of time.
Note that in all the preceding we have implicitly assumed that pages
will be loaded on demand this is called demand paging. It is
also possible to attempt to predict what pages will be required in
the future, and preload the pages in anticipation of their use. The
penalty for a bad guess is high, however, since part of memory will
be filled with useless information. Some systems do use preloading
algorithms, but most present systems rely on demand paging.
150
Some real memory systems X86-64

Modern Intel processors are 64 bit machines with (potentially) a 64
bit virtual address. In practice, however, the current architecture
actually provides a 48 bit virtual address, with hardware support for
page sizes of 4KB, 2MB, and 1GB. It uses a four level page hierarchy,
as shown:
unused
63
page map
page page
level 4 level 3 directory table
48 47
39 38
30 29
21 20
12 11
offset
0
The 12 bit offset specifies the byte in a 4KB page. The 9 bit (512
entry) page table points to the specific page, while the three higher
level (9 bit, 512 entry) tables are used to point eventually to the page
table.
The page table itself maps 512 4KB pages, or 2MB of memory.
Adding one more level increases this by another factor of 512, for
1GB of memory, and so on.
Clearly, most programs do not use anywhere near all the available
virtual memory, so the page tables higher level page maps are very
sparse.
Both Windows 7/8 and Linux use a page size of 4KB, although Linux
also supports a 2MB page size for some applications.
151
152
The 32 bit ARM processor

The 32 bit ARM processors support 4KB and 1MB page sizes, as
well as 16KB and 16MB page sizes. The following shows how a 4KB
page is mapped with a 2-level mapping:
outer
page
31
inner
page
22 21
offset
12 11
4KB
page
The 10 bit (1K entry) outer page table points to an inner page
table of the same size. The inner page table contains the mapping for the virtual page in physical memory.
Again, Linux on the ARM architecture uses 4KB pages, as do the
other operating systems commonly running on the ARM.
Different ARM implementations have different size TLBs, implemented in the hardware. Of course, the page table mapping is used
only on a TLB miss.
153
A quick overview of the UNIX system kernel:

user
programs
libraries
traps, interrupts
traps
User
Kernel
...............................................................................................................................................................................................................................................................................................................................................................
system call interface
file
subsystem
process
...
...
..................................................................................................
...
...
...
...
....
....
.
...
....................................................................................................
inter-process
communication
...
....
control ...................................................................................................
...
...
...
scheduler
...
....
subsystem ....
..
....................................................................................................
...
....................................................................................................
...
...
...
....
..
..
...
.......................................................................................................
buffer cache
memory
management
char
block
hardware control
Kernel
Hardware
...............................................................................................................................................................................................................................................................................................................................................................
the computer
154
The allocation of processes to the processor

In order for virtual memory management to be effective, it must be
possible to execute several processes concurrently, with the processor
switching back and forth among processes.
We will now consider the problem of allocating the processor itself
to those processes. We have already seen that, for a single processor
system, if two (or more) processes are in memory at the same time,
then each process must be able to assume at least 3 states, as follows:
active processes is currently running
ready process is ready to run but has not yet been selected by
the scheduler
blocked the process cannot be scheduled to run until an external
(to the process) event occurs. Typically, this means that the
process is waiting for some resource, or for input.
Real operating systems require more process states. We have already
seen a simplified process state diagram for the UNIX operating system; following is a more realistic process state diagram for the UNIX
system:
155

..
...
.
.
.
...
.
.
..
...
...
..
user
running
sys call
or interrupt
zombie
return
interrupt return
exit
sleep
asleep, in
memory
swap out
return to user
asleep,
swapped
kernel
running
preempted
preempt
schedule
process
ready, in
wakeup
memory
swap out
wakeup
swap in
ready,
swapped
156
enough mem
created
fork
not enough mem

(swapped)
In the operating system, each process is represented by its own process control block (sometimes called a task control block, or job
control block). This process control block is a data structure (or set
of structures) which contains information about the process. This
information includes everything required to continue the process if it
is blocked for any reason (or if it is interrupted). Typical information
would include:
the process state ready, running, blocked, etc.
the values of the program counter, stack pointer and other internal registers
process scheduling information, including the priority of the process, its elapsed time, etc.
memory management information
I/O information: status of I/O devices, queues, etc.
accounting information; e.g. CPU and real time, amount of disk
used, amount of I/O generated, etc.
157
In many systems, this space for these process control blocks is allocated (in system space memory) when the system is generated, and
places a firm limit on the number of processes which can be allocated
at one time. (The simplicity of this allocation makes it attractive,
even though it may waste part of system memory by having blocks
allocated which are rarely used.)
Following is a diagram of the process management data structures in
a typical UNIX system:
per process
region table
region table
u area
Text
Stack
process table
main memory
158
Process scheduling:
Although in a modern multi-tasking system, each process can make
use of the full resources of the virtual machine while actually sharing these resources with other processes, the perceived use of these
resources may depend considerably on the way in which the various
processes are given access to the processor. We will now look at
some of the things which may be important when processes are to
be scheduled.
We can think of the scheduler as the algorithm which determines
which virtual machine is currently mapped onto the physical machine.
Actually, two types of scheduling are required; a long term scheduler which determines which processes are to be loaded into memory, and a short term scheduler which determines which of the
processes in memory will actually be running at any given time. The
short-term scheduler is also called the dispatcher.
Most scheduling algorithms deal with one or more queues of processes; each process in each queue is assigned a priority in some way,
and the process with the highest priority is the process chosen to run
next.
159
Criteria for scheduling algorithms (performance)

CPU utilization
throughput (e.g., no. of processes completed/unit time)
waiting time (the amount of time a job waits in the queue)
turnaround time (the total time, including waiting time, to complete a job)
response time
In general, it is not possible to optimize all these criteria for process
scheduling using any algorithm (i.e., some of the criteria may conflict, in some circumstances). Typically, the criteria are prioritized,
with most attention paid to the most important criterion. e.g., in
an interactive system, response time may well be considered more
important then CPU utilization.
160
Commonalities in the memory hierarchy

There are three types of misses in a replicated hierarchical memory
(e.g., cache or virtual memory):
Compulsory misses first access to a block that has not yet
been in a particular level of the hierarchy. (e.g., first access to a
cache line or a page of virtual memory).
Capacity misses misses caused when the particular level of the
hierarchy cannot contain all the blocks needed. (e.g., replacing
a cache line on a cache miss, or a page on a page miss).
Conflict misses misses caused when multiple blocks compete
for the same set. (e.g., misses in a direct mapped or set-associative
cache that would not occur in a fully associative cache. These
only occur if there is a fixed many-to-one mapping.)
Compulsory misses are inevitable in a hierarchy.
Capacity misses can sometimes be reduced by adding more memory
to the particular level.
161
Replacement strategies
There are a small number of commonly used replacement strategies
for a block:
random replacement
first-in first-out (FIFO)
first-in not used first-out (clock)
Least recently used (LRU)
Speculation (prepaging, preloading)
Writing
There are two basic strategies for writing data from one level of the
hierarchy to the other:
Write through both levels are consistent, or coherent.
Write back only the highest level has the correct value, and it
is written back to the next level on replacement. This implies that
there is a way of indicating that the block has been written into (e.g.,
with a used bit.)
162
Differences between levels of memory hierarchies

Finding a block
Blocks are found by fast parallel searches at the highest level, where
speed is important (e.g., full associativity, set associative mapping,
direct mapping).
At lower levels, where access time is less important, table lookup can
be used (even multiple lookups may be tolerated at the lowest levels.)
Block sizes
Typically, the block size increases as the hierarchy is descended. In
many levels, the access time is large compared to the transfer time, so
using larger block sizes amortizes the access time over many accesses.
Capacity and cost
Invariably, the memory capacity increases, and the cost per bit decreases, as the hierarchy is descended.
163
Input and Output (I/O)
So far, we have generally ignored the fact that computers occasionally

meed to interact with the outside world they require external
inputs and generate output.
Before looking at more complex systems, we will look at a simple
single processor, similar to the MIPS, and look at some of the ways
it interacts with the world.
The processor we will use is the same processor that is found in the
small Arduino boards, a very popular and useful microcontroller used
by hobbyists (and others) to control many different kinds of devices.
It is the Atmel ATmega168 (or ATmega328 identical, but more
memory).
Two applications of the Arduino boards in this department are controlling small robots, and 3D printers.
164
ATMEL AVR architecture
We will use the ATMEL AVR series of processors as example input/output processors, or controllers for I/O devices.
These 8-bit processors, and others like them (PIC microcontrollers,
8051s, etc.) are perhaps the most common processors in use today.
Frequently, they are not used as individually packaged processors,
but as part of embedded systems, particularly as controllers for other
components in a larger integrated system (e.g., mobile phones).
There are also 16-, 32- and 64-bit processors in the embedded systems
market; the MIPS processor family is commonly used in the 32-bit
market, as is the ARM processor. (The ARM processor is universal
in the mobile telephone market.)
We will look at the internal architecture of the ATMEL AVR series
of 8-bit microprocessors.
They are available as single chip devices in package sizes from 8 pins
(external connections) to 100 pins, and with program memory from
1 to 256 Kbytes.
165
AVR architecture
Internally, the AVR microcontrollers have:
32 8-bit registers, r0 to r31
16 bit instruction word
a minimum 16-bit program counter (PC)
separate instruction and data memory
64 registers dedicated to I/O and control
externally interruptible, interrupt source is programmable
most instructions execute in one cycle
The top 6 registers can be paired as address pointers to data memory.
The X-register is the pair (r26, r27),
the Y-register is the pair (r28, r29), and
the Z-register is the pair (r30, r31).
The Z-register can also point to program memory.
Generally, only the top 16 registers (r16 to r31) can be targets for
the immediate instructions.
166
The program memory is flash programmable, and is fixed until overwritten with a programmer. It is guaranteed to survive at least 10,000
rewrites.
In many processors, self programming is possible. A bootloader can
be stored in protected memory, and a new program downloaded by
a simple serial interface.
In some older small devices, there is no data memory programs use
registers only. In those processors, there is a small stack in hardware
(3 entries). In the processors with data memory, the stack is located
in the data memory.
The size of on-chip data memory (SRAM static memory) varies
from 0 to 16 Kbytes.
Most processors also have EEPROM memory, from 0 to 4 Kbytes.
The C compiler can only be used for devices with SRAM data memory.
Only a few of the older tiny ATMEL devices do not have SRAM
memory.
167
ATMEL AVR datapath

8bit
data bus
Program
counter
Stack
pointer
Program
flash
SRAM or
hardware stack
Instruction
register
General
purpose
registers
X
Instruction
decoder
Control
lines
ALU
Status
register
Note that the datapath is 8 bits, and the ALU accepts two independent operands from the register file.
Note also the status register, (SREG) which holds information about
the state of the processor; e.g., if the result of a comparison was 0.
168
Typical ATMEL AVR device

Processor core
8bit data bus
Onchip peripherals
Program
counter
Stack
pointer
Internal
oscillator
Program
flash
SRAM or
hardware stack
Watchdog
timer
Instruction
register
General
purpose
registers
MCU control
register
X
Instruction
decoder
Control
lines
ALU
Interrupt
unit
Status
register
Data
EEprom
Port X data
direction register
ADC
Timers
Port X
data register
Port X drivers
Analog comparator
Px7
Px0
169
Timing and
control
The AVR memory address space

Instructions are 16 bits, and are in a separate memory from data.
Instruction memory starts at location 0 and runs to the maximum
program memory available.
Following is the memory layout for the ATmega 168, commonly used
in the Arduino world:
Program memory map
Data memory map

0X0000
Application
program
section
32 general
purpose registers
64 I/O registers
0X0000
0X001F
0X005F
160 ext. I/O

registers
0X00FF
Internal
SRAM
Boot loader
section
0X1FFF
0X04FF
Note that the general purpose registers and I/O registers are mapped
into the data memory.
Although most (all but 2) AVR processors have EEPROM memory,
this memory is accessed through I/O registers.
170
The AVR instruction set

Like the MIPS, the AVR is a load/store architecture, so most arithmetic and logic is done in the register file.
The basic register instruction is of the type Rd Rd op Rs
For example,
add r2, r3
; adds the contents of r3 to r2

; leaving result in r2
Some operations are available in several forms.

For example, there is an add with carry (adc) which also sets the
carry bit in the status register SREG if there is a carry.
There are also immediate instructions, such as subtract with carry
immediate (sbci)
sbci r16, 47
; subtract 47 from the contents of r16,

; leaving result in r16 and set carry flag
Many immediate instructions, like this one, can only use registers 16
to 31, so be careful with those instructions.
There is no add immediate instruction.
There is an add word immediate instruction which operates on the
pointer registers as 16 bit entities in 2 cycles. The maximum constant
which can be added is 63.
There are also logical and logical immediate instructions.
171
There are many data movement operations relating to loads and

stores. Memory is typically addressed through the register pairs X
(r26, r27), Y (r28, r29), and Z (r30, r31).
A typical instruction accessing data memory is load indirect LD.
It uses one of the index registers, and places a byte from the memory
addressed by the index register in the designated register.
ld r19, X
; load register 19 with the value pointed to

; by the index register X (r26, r27)
These instructions can also post-increment or pre-decrement the index register. E.g.,
ld r19, X+ ; load register 19 with the value pointed
; to by the index register X (r26, r27)
; and add 1 to the register pair
ld r19, -Y ; subtract 1 from the index reg. Y (r28, r29)
; then load register 19 with the value pointed
; to by the decremented value of Y
There is also a load immediate (ldi) which can only operate on
registers r16 to r31.
ldi r17, 14
; place the constant 14 in register 17
There are also push and pop instructions which push a byte onto,
or pop a byte off, the stack. (The stack pointer is in the I/O space,
registers 0X3D, OX3E).
172
There are a number of branch instructions, depending on values in the

status register. For example, branch on carry set (BRCS) branches
to a target address by adding a displacement (-64 to +63) to the
program counter (actually, PC +1) if the carry flag is set.
brcs
-14
; jumps back -14 + 1 = 13 instructions
Perhaps the most commonly used is the relative jump (rjmp)

instruction which jumps forward or backwards by 2K words.
rjmp
191
; jumps forward 191 + 1 = 192 instructions
The relative call (rcall) instruction is similar, but places the return
address (PC + 1) on the stack.
The return instruction (ret) returns from a function call by replacing
the PC with the value on the stack.
There are also instructions which skip over the next instruction on
some condition. For example, the instruction skip on register bit
set (SRBS) skips the next instruction (increments PC by 2 or 3) if a
particular bit in the designated register is set.
sbrs
r1, 4
; skips next instruction if

; bit 4 in r1 is 1
173
There are many instructions for bit manipulation; bit rotation in a

byte, bit shifting, and setting and clearing individual bits.
There are also instructions to set and clear individual bits in the
status register, and to enable and disable global interrupts.
The instructions SEI (set global interrupt flag) and CLI (clear global
interrupt flag) enable and disable interrupts under program control.
When an interrupt occurs, the global interrupt flag is cleared, and
reset when the return from interrupt (reti) is executed.
Individual devices (e.g., timers) can also be set as interrupting devices, and also have their interrupt capability turned off. We will
look at this capability later.
There are also instructions to input values from and output values
to specific I/O pins, and sets of I/O pins called ports.
We will look in more detail at these instructions later.
174
The status register (SREG)

One of the differences between the MIPS processor and the AVR
is that the AVR uses a status register the MIPS uses the set
instructions for conditional branches.
The SREG has the following format:
7 6 5 4 3 2 1 0
I T H S V N Z C
I is the interrupt flag when it is cleared, the processor cannot be

interrupted.
T is used as a source or target by the bit copy instructions, BLD (bit
load) and BST (bit store).
H is used as a half carry flag for the BCD (binary coded decimal)
instructions.
S is the sign bit. S = N V.
V is the 2s complement overflow bit.
N is the negative flag, set when a result from the ALU is negative.
Z is the zero flag, set when the result from the ALU is zero.
175
An example of program-controlled I/O for the AVR

Programming the input and output ports (ports are basically registers connected to sets of pins on the chip) is interesting in the AVR,
because each pin in a port can be set to be an input or an output
pin, independent of other pins in the port.
Ports have three registers associated with them.
The data direction register (DDR) determines which pins are inputs
(by writing a 0 to the DDR at the bit position corresponding to
that pin) and which are output pins (similarly, by writing a 1 in the
DDR).
The PORT is a register which contains the value written to an output
pin, or the value presented to an input pin.
Ports can be written to or read from.
The PIN register can only be read, and the value read is the value
presently at the pins in the register. Input is read from a pin.
The short program following shows the use of these registers to control, read, and write values to two pins of PORTB. (Ports are designated by letters in the AVR processors.)
We assume that a push button is connected to pin 4 of port B.
Pressing the button connects this pin to ground (0 volts) and would
cause an input of 0 at the pin.
Normally, a pull-up resistor of about 10K ohms is used to keep the
pin high (1) when the switch is open.
The speaker is connected to pin 5 of port B.
176
A simple program-controlled I/O example

The following program causes the speaker to buzz when the button
is pressed. It is an infinite loop, as are many examples of program
controlled I/O.
The program reads pin 4 of port B until it finds it set to zero (the
button is pressed). Then it jumps to code that sets bit 5 of port
B (the speaker input) to 0 for a fixed time, and then resets it to 1.
(Note that pins are read, ports are written.)
#include <m168def.inc>
.org 0
; define interrupt vectors
vects:
rjmp
reset
reset:
ldi R16, 0b00100000
; load register 16 to set PORTB

; registers as input or output
out DDRB, r16
; set PORTB 5 to output,

; others to input
ser R16
; load register 16 to all 1s
out PORTB, r16
; set pullups (1s) on inputs
177
LOOP:
; infinite wait loop
sbic PINB, 4
; skip next line if button pressed
rjmp LOOP
; repeat test
cbi PORTB, 5
; set speaker input to 0
ldi R16, 128
; set loop counter to 128
SPIN1:
; wait a few cycles
subi R16, 1
brne SPIN1
sbi PORTB, 5
ldi R16, 128
SPIN2:
subi R16, 1
brne SPIN2
rjmp LOOP
; speaker buzzed 1 cycle,

; see if button still pressed
178
Following is a (roughly) equivalent C program:

#include <avr/io.h>
#include <util/delay.h>
int main(void)
{
DDRB
= 0B00100000;
PORTB = 0B11111111;
while (1) {
while(!(PINB&0B00010000))
PORTB = 0B00100000;
_delay_loop_1(128);
PORTB = 0;
_delay_loop_1(128);
}
}
return(1);
}
Two words about mechanical switches they bounce! That is, they
make and break contact several times in the few microseconds before
full contact is made or broken. This means that a single switch
operation may be seen as several switch actions.
The way this is normally handled is to read the value at a switch (in
a loop) several times over a short period, and report a stable value.
179
Interrupts in the AVR processor

The AVR uses vectored interrupts, with fixed addresses in program
memory for the interrupt handling routines.
Interrupt vectors point to low memory; the following are the locations
of the memory vectors for some of the 26 possible interrupt events in
the ATmega168:
Address Source
Event
0X0000 RESET
power on or reset
0X0002 INT0
External interrupt request 0
0X0004 INT1
0X0006 PCINT0
pin change interrupt request 0
0X0008 PCINT1
0X000A PCINT2
0X000C WDT
Watchdog Timer
0X000E TIMER2 COMPA Timer/counter 2 compare match A

0X0010 TIMER1 COMPB Timer/counter 2 compare match B
Interrupts are prioritized as listed; RESET has the highest priority.

Normally, the instruction at the memory location of the vector is
a jmp to the interrupt handler. (In processors with 2K or fewer
program memory words, rjmp is sufficient.)
In fact, our earlier assembly language program used an interrupt to
begin execution the RESET interrupt.
180
Reducing power consumption with interrupts

The previous program only actually did something interesting when
the button was pressed. The AVR processors have a sleep mode
in which the processor can enter a low power mode until some external (or internal) event occurs, and then wake up an continue
processing.
This kind of feature is particularly important for battery operated
devices.
The ATmega168 datasheet describe the six sleep modes in detail.
Briefly, the particular sleep mode is set by placing a value in the
Sleep Mode Control Register (SMCR). For our purposes, the Powerdown mode would be appropriate (bit pattern 0000010X) but the
simulator only understands the idle mode (0000000X).
The low order bit in SMCR is set to 1 or 0 to enable or disable sleep
mode, respectively.
The sleep instruction causes the processor to enter sleep mode, if
bit 0 of SMCR is set.
The following code can be used to set sleep mode to idle and enable
sleep mode:
ldi R16, 0b00000001 ; set sleep mode idle
out SMCR, R16
; for power-down mode, write 00000101
181
Enabling external interrupts in the AVR

Pages 5154 of the ATmega168 datasheet describe in detail the three
I/O registers which control external interrupts. General discussion
of interrupts begins on page 46.
We will only consider one type of external interrupt, the pin change
interrupt. There are 16 possible pcint interrupts, labeled pcint0
to pcint15, associated with PORTE[0-7] and PORTB[0-7], respectively.
There are two PCINT interrupts, PCINT0 for pcint inputs 07, and
PCINT1 for pcint inputs 815.
We are interested in using the interrupt associated with the pushbutton switch connected to PINB[4], which is pcint12, and therefore is interrupt type PCINT1.
Two registers control external interrupts, the External Interrupt
Mask Register (EIMSK), and the External Interrupt Flag Register
(EIFR).
Only bits 0, 6, and 7 are defined for these registers. Bit 0 controls
the general external interrupt (INT0).
Bits 7 and 6 control PCINT1 and PCINT0, respectively.
Setting the appropriate bit of register EIMSK (in our case, bit 7 for
PCINT1) enables the particular pin change interrupt.
The corresponding bit in register EIFR is set when the appropriate
interrupt external condition occurs.
A pending interrupt can be cleared by writing a 1 to this bit.
182
The following code enables pin change interrupt 1 (PCINT1) and

clears any pending interrupts by writing 1 to bit 7 of the respective
registers:
sbi EIFR, 7
; clear pin change interrupt flag 1
sbi EIMSK, 7
; enable pin change interrupt 1
There is also a register associated with the particular pins for the
PCINT interrupts. They are the Pin Change Mask Registers (PCMSK1
and PCMSK0).
We want to enable the input connected to the switch, at PINB[4],
which is pcint12, and therefore set bit 4 of PCMSK1, leaving the
other bits unchanged.
Normally, this would be possible with the code
sbi PCMSK1, 4
Unfortunately, this register is one of the extended I/O registers, and
must be written to as a memory location.
The following code sets up its address in register pair Y, reads the
current value in PCMSK1, sets bit 4 to 1, and rewrites the value in
memory.
ldi r28, PCMSK1
; load address of PCMSK1 in Y low
clr r29
; load high byte of Y with 0
ld r16, Y
; read value in PCMSK1
sbr r16,0b00010000
; allow pin change interrupt on

; PORTB pin 4
st Y, r16
; store new PCMSK1

183
Now, the appropriate interrupt vectors must be set, as in the table shown earlier, and interrupts enabled globally by setting the
interrupt (I) flag in the status register (SREG).
This later operation is performed with the instruction sei.
The interrupt vector table should look as follows:
.org 0
vects:
jmp
RESET
; vector for reset
jmp
EXT_INT0
; vector for int0
jmp
EXT_INT1
; vector for int1
jmp
PCINT0
; vector for pcint0
jmp
PCINT1
; vector for pcint1
jmp
PCINT2
; vector for pcint2
The next thing necessary is to set the stack pointer to a high memory
address, since interrupts push values on the stack:
ldi r16, 0xff
; set stack pointer
out SPL, r16

ldi r16, 0x04
out SPH, r16
After this, interrupts can be enabled after the I/O ports are set up,
as in the program-controlled I/O example.
Following is the full code:
184
.org 0
VECTS:
jmp
RESET
jmp
EXT_INT0
jmp
EXT_INT1
jmp
PCINT_0
jmp
BUTTON
jmp
PCINT_2
jmp
WDT
;
;
;
;
;
;
;
vector
vector
vector
vector
vector
vector
vector
for
for
for
for
for
for
for
reset
int0
int1
pcint0
pcint1
pcint2
watchdog timer
ldi r28, PCMSK1

clr r29
ld r16, Y
sbr r16,0b00010000
st Y, r16
;
;
;
;
;
;
set up pin change interrupt 1

load address of PCMSK1 in Y low
load high byte with 0
read value in PCMSK1
allow pin change interrupt on portB pin 4
store new PCMSK1
sbi EIMSK, 7
sbi EIFR, 7

ldi
out
ldi
out
; set stack pointer
EXT_INT0:
EXT_INT1:
PCINT_0:
PCINT_2:
WDT:
reti
RESET:
r16,
SPL,
r16,
SPH,
0xff
r16
0x04
r16
ldi R16, 0b00100000

out DDRB, r16
ser R16
; load register 16 to set portb registers

; set portb 5 to output, others to input
;
185
out PORTB, r16
sei
; enable interrupts
ldi R16, 0b00000001

out SMCR, R16
rjmp LOOP
; set sleep mode
BUTTON:
reti
rjmp LOOP
SNOOZE:
sleep
LOOP:
sbic PINB, 4
rjmp SNOOZE
cbi PORTB, 5
ldi R16, 128
SPIN1:

; go back to sleep if button not pressed
;
; wait a few cycles
subi R16, 1
brne SPIN1
sbi PORTB, 5
ldi R16, 128
SPIN2:
subi R16, 1
brne SPIN2
rjmp LOOP

186
Input-Output Architecture
In our discussion of the memory hierarchy, it was implicitly assumed
that memory in the computer system would be fast enough to
match the speed of the processor (at least for the highest elements
in the memory hierarchy) and that no special consideration need be
given about how long it would take for a word to be transferred from
memory to the processor an address would be generated by the
processor, and after some fixed time interval, the memory system
would provide the required information. (In the case of a cache miss,
the time interval would be longer, but generally still fixed. For a
page fault, the processor would be interrupted; and the page fault
handling software invoked.)
Although input-output devices are mapped to appear like memory
devices in many computer systems, I/O devices have characteristics
quite different from memory devices, and often pose special problems
for computer systems. This is principally for two reasons:
I/O devices span a wide range of speeds. (e.g. terminals accepting input at a few characters per second; disks reading data at
over 10 million characters / second).
Unlike memory operations, I/O operations and the CPU are not
generally synchronized with each other.
187
I/O devices also have other characteristics; for example, the amount
of data required for a particular operation. For example, a keyboard
inputs a single character at a time, while a color display may use
several Mbytes of data at a time.
The following lists several I/O devices and some of their typical properties:
Device
Data size (KB) Data rate (KB/s) Interaction
keyboard
0.001
0.01
human/machine
mouse
0.001
0.1
human/machine
voice input
human/machine
laser printer
1 1000+
1000
machine/human
graphics display 1000
100,000+
machine/human
magnetic disk
4 4000
100,000+
system
CD/DVD
1000
system
LAN
100,000+
system/system
Note the wide range of data rates and data sizes.

Some operating systems distinguish between low volume/low rate
I/O devices and high volume/high rate devices.
(We have already seen that UNIX and LINUX systems distinguish
between character and block devices.)
188
The following figure shows the general I/O structure associated with
many medium-scale processors. Note that the I/O controllers and
main memory are connected to the main system bus. The cache
memory (usually found on-chip with the CPU) has a direct connection to the processor, as well as to the system bus.
CPU
.
....
Interrupts
and control
Cache
.....
..
....
...........
System bus
I/O controller
I/O controller
Main
Memory
.....
..
....
I/O devices
Note that the I/O devices shown here are not connected directly
to the system bus, they interface with another device called an I/O
controller.
189
In simpler systems, the CPU may also serve as the I/O controller,
but in systems where throughput and performance are important,
I/O operations are generally handled outside the processor.
In higher performance processors (desktop and workstation systems)
there may be several separate I/O buses. The PC today has separate
buses for memory (the FSB, or front-side bus), for graphics (the AGP
bus or PCIe/16 bus), and for I/O devices (the PCI or PCIe bus).
It has one or more high-speed serial ports (USB or Firewire), and
100 Mbit/s or 1 Gbit/s network ports as well. (The PCIe bus is also
serial.)
It may also support several legacy I/O systems, including serial
(RS-232) and parallel (printer) ports.
Until relatively recently, the I/O performance of a system was somewhat of an afterthought for systems designers. The reduced cost of
high-performance disks, permitting the proliferation of virtual memory systems, and the dramatic reduction in the cost of high-quality
video display devices, have meant that designers must pay much
more attention to this aspect to ensure adequate performance in the
overall system.
Because of the different speeds and data requirements of I/O devices,
different I/O strategies may be useful, depending on the type of I/O
device which is connected to the computer. We will look at several
different I/O strategies later.
190
Synchronization the two wire handshake

Because the I/O devices are not synchronized with the CPU, some
information must be exchanged between the CPU and the device to
ensure that the data is received reliably. This interaction between
the CPU and an I/O device is usually referred to as handshaking.
Since communication can be in both directions, it is usual to consider
that there are two types of behavior talking and listening.
Either the CPU or the I/O device can act as the talker or the listener.
For a complete handshake, four events are important:
1. The device providing the data (the talker) must indicate that
valid data is now available.
2. The device accepting the data (the listener) must indicate that
it has accepted the data. This signal informs the talker that it
need not maintain this data word on the data bus any longer.
3. The talker indicates that the data on the bus is no longer valid,
and removes the data from the bus. The talker may then set up
new data on the data bus.
4. The listener indicates that it is not now accepting any data on the
data bus. the listener may use data previously accepted during
this time, while it is waiting for more data to become valid on
the bus.
191
Note that each of the talker and listener supply two signals. The
talker supplies a signal (say, data valid, or DAV ) at step (1). It
supplies another signal (say, data not valid, or DAV ) at step (3).
Both these signals can be coded as a single binary value (DAV )
which takes the value 1 at step (1) and 0 at step (3). The listener
supplies a signal (say, data accepted, or DAC) at step (2). It supplies
a signal (say, data not now accepted, or DAC) at step (4). It, too,
can be coded as a single binary variable, DAC. Because only two
binary variables are required, the handshaking information can be
communicated over two wires, and the form of handshaking described
above is called a two wire handshake.
The following figure shows a timing diagram for the signals DAV
and DAC which illustrates the timing of these four events:
(1)
(3)- - - 1
DAV
(2)
DAC
0
(4)
- - 1
0
1. Talker provides valid data

2. Listener has received data
3. Talker acknowledges listener has data
4. Listener resumes listening state
192
As stated earlier, either the CPU or the I/O device can act as the
talker or the listener. In fact, the CPU may act as a talker at one
time and a listener at another. For example, when communicating
with a terminal screen (an output device) the CPU acts as a talker,
but when communicating with a terminal keyboard (an input device)
the CPU acts as a listener.
This is about the simplest synchronization which can guarantee reliable communication between two devices. It may be inadequate
where there are more than two devices.
Other forms of handshaking are used in more complex situations; for
example, where there may be more than one controller on the bus,
or where the communication is among several devices.
For example, there is also a similar, but more complex, 3-wire handshake which is useful for communicating among more than two devices.
193
I/O control strategies

Several I/O strategies are used between the computer system and I/O
devices, depending on the relative speeds of the computer system and
the I/O devices.
Program-controlled I/O: The simplest strategy is to use the
processor itself as the I/O controller, and to require that the device follow a strict order of events under direct program control,
with the processor waiting for the I/O device at each step.
Interrupt controlled I/O: Another strategy is to allow the processor to be interrupted by the I/O devices, and to have a
(possibly different) interrupt handling routine for each device.
This allows for more flexible scheduling of I/O events, as well
as more efficient use of the processor. (Interrupt handling is an
important component of the operating system.)
DMA: Another strategy is to allow the I/O device, or the controller
for the device, access to the main memory. The device would
write a block of information in main memory, without intervention from the CPU, and then inform the CPU in some way that
that block of memory had been overwritten or read. This might
be done by leaving a message in memory, or by interrupting the
processor. (This is generally the I/O strategy used by the highest
speed devices hard disks and the video controller.)
194
Program-controlled I/O
One common I/O strategy is program-controlled I/O, (often called
polled I/O). Here all I/O is performed under control of an I/O handling procedure, and input or output is initiated by this procedure.
The I/O handling procedure will require some status information
(handshaking information) from the I/O device (e.g., whether the
device is ready to receive data). This information is usually obtained
through a second input from the device; a single bit is usually sufficient, so one input port can be used to collect status, or handshake,
information from several I/O devices. (A port is the name given to a
connection to an I/O device; e.g., to the memory location into which
an I/O device is mapped). An I/O port is usually implemented as
a register (possibly a set of D flip flops) which also acts as a buffer
between the CPU and the actual I/O device. The word port is often
used to refer to the buffer itself.
Typically, there will be several I/O devices connected to the processor; the processor checks the status input port periodically, under
program control by the I/O handling procedure. If an I/O device
requires service, it will signal this need by altering its input to the
status port. When the I/O control program detects that this has
occurred (by reading the status port) then the appropriate operation
will be performed on the I/O device which requested the service.
195
A typical configuration might look somewhat as shown in the following figure.

The outputs labeled handshake in would be connected to bits in
the status port. The input labeled handshake in would typically
be generated by the appropriate decode logic when the I/O port
corresponding to the device was addressed.
HANDSHAKE IN
TO PORT 1
HANDSHAKE OUT
DEVICE 1
TO PORT 2
DEVICE 2
TO PORT 3
TO PORT N
196
DEVICE 3
q
q
q
DEVICE N
Program-controlled I/O has a number of advantages:

All control is directly under the control of the program, so changes
can be readily implemented.
The order in which devices are serviced is determined by the
program, this order is not necessarily fixed but can be altered by
the program, as necessary. This means that the priority of a
device can be varied under program control. (The priority of
a determines which of a set of devices which are simultaneously
ready for servicing will actually be serviced first).
It is relatively easy to add or delete devices.
Perhaps the chief disadvantage of program-controlled I/O is that a
great deal of time may be spent testing the status inputs of the
I/O devices, when the devices do not need servicing. This busy
wait or wait loop during which the I/O devices are polled but no
I/O operations are performed is really time wasted by the processor,
if there is other work which could be done at that time. Also, if a
particular device has its data available for only a short time, the data
may be missed because the input was not tested at the appropriate
time.
197
Program controlled I/O is often used for simple operations which

must be performed sequentially. For example, the following may be
used to control the temperature in a room:
DO forever
INPUT temperature
IF (temperature < setpoint) THEN
turn heat ON
ELSE
turn heat OFF
END IF
Note here that the order of events is fixed in time, and that the
program loops forever. (It is really waiting for a change in the temperature, but it is a busy wait.)
Simple processors designed specifically for device control, and which
have a few Kbytes of read-only memory and a small amount of readwrite memory are very low in cost, and are used to control an amazing
number of devices.
198
An example of program-controlled I/O for the AVR

Programming the input and output ports (ports are basically registers connected to sets of pins on the chip) is interesting in the AVR,
because each pin in a port can be set to be an input or an output
pin, independent of other pins in the port.
Ports have three registers associated with them.
The data direction register (DDR) determines which pins are inputs
(by writing a 0 to the DDR at the bit position corresponding to
that pin) and which are output pins (similarly, by writing a 1 in the
DDR).
The PORT is a register which contains the value written to an output
pin, or the value presented to an input pin.
Ports can be written to or read from.
The PIN register can only be read, and the value read is the value
presently at the pins in the register. Input is read from a pin.
The short program following shows the use of these registers to control, read, and write values to two pins of PORTB. (Ports are designated by letters in the AVR processors.)
In the following example, the button is connected to pin 4 of port B.
Pressing the button connects this pin to ground (0 volts) and would
cause an input of 0 at the pin.
Normally, a pull-up resistor is used to keep the pin high (1) when
the switch is open. These are provided in the processor.
The speaker is connected to pin 5 of port B.
199
A simple program-controlled I/O example

The following program causes the speaker to buzz when the button
is pressed. It is an infinite loop, as are many examples of program
controlled I/O.
The program reads pin 4 of port B until it finds it set to zero (the
button is pressed). Then it jumps to code that sets bit 5 of port
B (the speaker input) to 0 for a fixed time, and then resets it to 1.
(Note that pins are read, ports are written.)
.org 0
; define interrupt vectors
vects:
rjmp
reset
reset:
ldi R16, 0b00100000
; load register 16 to set PORTB

; registers as input or output
out DDRB, r16
; set PORTB 5 to output,

; others to input
ser R16
; load register 16 to all 1s
out PORTB, r16
200
LOOP:
; infinite wait loop
sbic PINB, 4
rjmp LOOP
; repeat test
cbi PORTB, 5
ldi R16, 128
SPIN1:
; wait a few cycles
subi R16, 1
brne SPIN1
sbi PORTB, 5
ldi R16, 128
SPIN2:
subi R16, 1
brne SPIN2
rjmp LOOP

201
Following is a (roughly) equivalent C program:

#include <avr/io.h>
#include <util/delay.h>
int main(void)
{
DDRB
= 0B00100000;
PORTB = 0B11111111;
while (1) {
while(!(PINB&0B00010000))
PORTB = 0B00100000;
_delay_loop_1(128);
PORTB = 0;
_delay_loop_1(128);
}
}
return(1);
}
Two words about mechanical switches they bounce! That is, they
make and break contact several times in the few microseconds before
full contact is made or broken. This means that a single switch
operation may be seen as several switch actions.
The way this is normally handled is to read the value at a switch (in
a loop) several times over a short period, and report a stable value.
202
Interrupt-controlled I/O
Interrupt-controlled I/O reduces the severity of the two problems
mentioned for program-controlled I/O by allowing the I/O device
itself to initiate the device service routine in the processor. This is
accomplished by having the I/O device generate an interrupt signal
which is tested directly by the hardware of the CPU. When the interrupt input to the CPU is found to be active, the CPU itself initiates
a subprogram call to somewhere in the memory of the processor; the
particular address to which the processor branches on an interrupt
depends on the interrupt facilities available in the processor.
The simplest type of interrupt facility is where the processor executes
a subprogram branch to some specific address whenever an interrupt
input is detected by the CPU. The return address (the location of
the next instruction in the program that was interrupted) is saved
by the processor as part of the interrupt process.
If there are several devices which are capable of interrupting the processor, then with this simple interrupt scheme the interrupt handling
routine must examine each device to determine which one caused
the interrupt. Also, since only one interrupt can be handled at a
time, there is usually a hardware priority encoder which allows the
device with the highest priority to interrupt the processor, if several
devices attempt to interrupt the processor simultaneously.
203
In the previous figure, the handshake out outputs would be connected to a priority encoder to implement this type of I/O. the other
connections remain the same. (Some systems use a daisy chain
priority system to determine which of the interrupting devices is serviced first. Daisy chain priority resolution is discussed later.)
HANDSHAKE IN
TO PORT 1
TO PRIORITY INTERRUPT CONTROLLER
DEVICE 1
TO PORT 2
DEVICE 2
TO PORT 3
TO PORT N
204
DEVICE 3
q
q
q
DEVICE N
Returning control from an interrupt

In most modern processors, interrupt return points are saved on a
stack in memory, in the same way as return addresses for subprogram calls are saved. In fact, an interrupt can often be thought of as
a subprogram which is invoked by an external device.
The return from an interrupt is similar to a return from a subprogram.
Note that the interrupt handling routine is normally responsible for
saving the state of, and restoring, any of the internal registers it uses.
If a stack is used to save the return address for interrupts, it is then
possible to allow one interrupt the interrupt handling routine of another interrupt.
In many computer systems, there are several priority levels of interrupts, each of which can be disabled, or masked.
There is usually one type of interrupt input which cannot be disabled (a non-maskable interrupt) which has priority over all other
interrupts. This interrupt input is typically used for warning the processor of potentially catastrophic events such as an imminent power
failure, to allow the processor to shut down in an orderly way and to
save as much information as possible.
205
Vectored interrupts
Many computers make use of vectored interrupts. With vectored
interrupts, it is the responsibility of the interrupting device to provide the address in main memory of the interrupt servicing routine
for that device. This means, of course, that the I/O device itself
must have sufficient intelligence to provide this address when requested by the CPU, and also to be initially programmed with
this address information by the processor. Although somewhat more
complex than the simple interrupt system described earlier, vectored
interrupts provide such a significant advantage in interrupt handling
speed and ease of implementation (i.e., a separate routine for each
device) that this method is almost universally used on modern computer systems.
Some processors have a number of special inputs for vectored interrupts (each acting much like the simple interrupt described earlier).
Others require that the interrupting device itself provide the interrupt address as part of the process of interrupting the processor.
206
Interrupts in the AVR processor

The AVR uses vectored interrupts, with fixed addresses in program
memory for the interrupt handling routines.
Interrupt vectors point to low memory; the following are the locations
of the memory vectors for some of the 23 possible interrupt events in
the ATmega168:
Address Source
Event
0X000
RESET
power on or reset
0X002
INT0
0X004
INT1
0X006
PCINT0 pin change interrupt request 0
0X008
0X00A
0X00C
WDT
Watchdog timer interrupt
Interrupts are prioritized as listed; RESET has the highest priority.

Normally, the instruction at the memory location of the vector is a
jmp to the interrupt handler.
(In processors with 2K or fewer program memory words, rjmp is
sufficient.)
In fact, our earlier assembly language program used an interrupt to
begin execution the RESET interrupt.
207
Reducing power consumption with interrupts

The previous program only actually did something interesting when
the button was pressed. The AVR processors have a sleep mode
in which the processor can enter a low power mode until some external (or internal) event occurs, and then wake up an continue
processing.
This kind of feature is particularly important for battery operated
devices.
The ATmega168 datasheet describes the five sleep modes in detail.
Briefly, the particular sleep mode is set by placing a value in the
Sleep Mode Control Register (SMCR). For our purposes, the Powerdown mode would be appropriate (bit pattern 0000010X) but the
simulator only understands the idle mode (0000000X).
The low order bit in SMCR is set to 1 or 0 to enable or disable sleep
mode, respectively.
The sleep instruction causes the processor to enter sleep mode, if
bit 0 of SMCR is set.
The following code can be used to set sleep mode to idle and enable
sleep mode:
ldi R16, 0b00000001 ; set sleep mode idle
out SMCR, R16
; for power-down mode, write 00000101
208
Enabling external interrupts in the AVR

The ATmega168 datasheet describe in detail the three I/O registers
which control external interrupts. It also has a general discussion of
interrupts.
We will only consider one type of external interrupt, the pin change
interrupt. There are 16 possible pcint interrupts, labeled pcint0
to pcint15, associated with PORTE[0-7] and PORTB[0-7], respectively.
There are two PCINT interrupts, PCINT0 for pcint inputs 07, and
PCINT1 for pcint inputs 815.
We are interested in using the interrupt associated with the pushbutton switch connected to PINB[4], which is pcint12, and therefore is interrupt type PCINT1.
Two registers control external interrupts, the External Interrupt
Mask Register (EIMSK), and the External Interrupt Flag Register
(EIFR).
Only bits 0, 6, and 7 are defined for these registers. Bit 0 controls
the general external interrupt (INT0).
Bits 7 and 6 control PCINT1 and PCINT0, respectively.
Setting the appropriate bit of register EIMSK (in our case, bit 7 for
PCINT1) enables the particular pin change interrupt.
The corresponding bit in register EIFR is set when the appropriate
interrupt external condition occurs.
A pending interrupt can be cleared by writing a 1 to this bit.
209
The following code enables pin change interrupt 1 (PCINT1) and

clears any pending interrupts by writing 1 to bit 7 of the respective
registers:
sbi EIFR, 7
sbi EIMSK, 7
There is also a register associated with the particular pins for the
PCINT interrupts. They are the Pin Change Mask Registers (PCMSK1
and PCMSK0).
We want to enable the input connected to the switch, at PINB[4],
which is pcint12, and therefore set bit 4 of PCMSK1, leaving the
other bits unchanged.
Normally, this would be possible with the code
sbi PCMSK1, 4
Unfortunately, this register is one of the extended I/O registers, and
must be written to as a memory location.
The following code sets up its address in register pair Y, reads the
current value in PCMSK1, sets bit 4 to 1, and rewrites the value in
memory.
ldi r28, PCMSK1
; load address of PCMSK1 in Y low
clr r29
; load high byte of Y with 0
ld r16, Y
; read value in PCMSK1
sbr r16,0b00010000
; allow pin change interrupt on

; PORTB pin 4
st Y, r16
; store new PCMSK1

210
Now, the appropriate interrupt vectors must be set, as in the table shown earlier, and interrupts enabled globally by setting the
interrupt (I) flag in the status register (SREG).
This later operation is performed with the instruction sei.
The interrupt vector table should look as follows:
.org 0
vects:
jmp
RESET
; vector for reset
jmp
EXT_INT0
; vector for int0
jmp
PCINT0
; vector for pcint0
jmp
PCINT1
; vector for pcint1
jmp
TIM2_COMP
; vector for timer 2 comp
The next thing necessary is to set the stack pointer to a high memory
address, since interrupts push values on the stack:
ldi r16, 0xff
; set stack pointer
out SPL, r16

ldi r16, 0x04
out SPH, r16
After this, interrupts can be enabled after the I/O ports are set up,
as in the program-controlled I/O example.
Following is the full code:
211
.org 0
VECTS:
jmp
RESET
jmp
EXT_INT0
jmp
EXT_INT1
jmp
PCINT_0
jmp
BUTTON
jmp
PCINT_2
;
;
;
;
;
;
vector
vector
vector
vector
vector
vector
for
for
for
for
for
for
reset
int0
int1
pcint0
pcint1
pcint2
ldi r28, PCMSK1

clr r29
ld r16, Y
sbr r16,0b00010000
st Y, r16
;
;
;
;
;
;
set up pin change interrupt 1

load address of PCMSK1 in Y low
load high byte with 0
read value in PCMSK1
allow pin change interrupt on portB pin 4
store new PCMSK1
sbi EIMSK, 7
sbi EIFR, 7

ldi
out
ldi
out
r16,
SPL,
r16,
SPH,
; set stack pointer
ldi
out
ser
out
R16, 0b00100000
DDRB, r16
R16
PORTB, r16
EXT_INT0:
EXT_INT1:
PCINT_0:
PCINT_2:
reti
RESET:
0xff
r16
0x04
r16
; load register 16 to set portb registers

; set portb 5 to output, others to input
;
212
sei
; enable interrupts
ldi R16, 0b00000001

out SMCR, R16
rjmp LOOP
; set sleep mode
BUTTON:
reti
rjmp LOOP
SNOOZE:
sleep
LOOP:
sbic PINB, 4
rjmp SNOOZE
cbi PORTB, 5
ldi R16, 128
SPIN1:

; go back to sleep if button not pressed
;
; wait a few cycles
subi R16, 1
brne SPIN1
sbi PORTB, 5
ldi R16, 128
SPIN2:
subi R16, 1
brne SPIN2
rjmp LOOP

213
Direct memory access

In most desktop and larger computer systems, a great deal of input
and output occurs among several parts of the I/O system and the
processor; for example, video display, and the disk system or network
controller. It would be very inefficient to perform these operations
directly through the processor; it is much more efficient if such devices, which can transfer data at a very high rate, place the data
directly into the memory, or take the data directly from the processor without direct intervention from the processor. I/O performed
in this way is usually called direct memory access, or DMA. The
controller for a device employing DMA must have the capability of
generating address signals for the memory, as well as all of the memory control signals. The processor informs the DMA controller that
data is available (or is to be placed into) a block of memory locations starting at a certain address in memory. The controller is also
informed of the length of the data block.
214
There are two possibilities for the timing of the data transfer from
the DMA controller to memory:
The controller can cause the processor to halt if it attempts to
access data in the same bank of memory into which the controller
is writing. This is the fastest option for the I/O device, but may
cause the processor to run more slowly because the processor
may have to wait until a full block of data is transferred.
The controller can access memory in memory cycles which are
not used by the particular bank of memory into which the DMA
controller is writing data. This approach, called cycle stealing,
is perhaps the most commonly used approach. (In a processor
with a cache that has a high hit rate this approach may not slow
the I/O transfer significantly).
DMA is a sensible approach for devices which have the capability of
transferring blocks of data at a very high data rate, in short bursts. It
is not worthwhile for slow devices, or for devices which do not provide
the processor with large quantities of data. Because the controller for
a DMA device is quite sophisticated, the DMA devices themselves
are usually quite sophisticated (and expensive) compared to other
types of I/O devices.
215
One problem that systems employing several DMA devices have to

address is the contention for the single system bus. There must be
some method of selecting which device controls the bus (acts as bus
master) at any given time. There are many ways of addressing the
bus arbitration problem; three techniques which are often implemented in processor systems are the following (these are also often
used to determine the priorities of other events which may occur simultaneously, like interrupts). They rely on the use of at least two
signals (bus request and bus grant), used in a manner similar
to the two-wire handshake.
Three commonly used arbitration schemes are:
Daisy chain arbitration
Prioritized arbitration
Distributed arbitration
Bus arbitration becomes extremely important when several processors share the same bus for memory. (We will look at this case in
the next chapter.)
216
Daisy chain arbitration Here, the requesting device or devices

assert the signal bus request. The bus arbiter returns the
bus grant signal, which passes through each of the devices
which can have access to the bus, as shown below. Here, the priority of a device depends solely on its position in the daisy chain.
If two or more devices request the bus at the same time, the highest priority device is granted the bus first, then the bus grant
signal is passed further down the chain. Generally a third signal (bus release) is used to indicate to the bus arbiter that the
first device has finished its use of the bus. Holding bus request
asserted indicates that another device wants to use the bus.
Priority 2
Priority 1
Priority n
Device 1
Device 2
Bus
Master
Grant
Grant
Request
217
...
Device n
Grant
Priority encoded arbitration Here, each device has a request line

connected to a centralized arbiter that determines which device
will be granted access to the bus. The order may be fixed by the
order of connection (priority encoded), or it may be determined
by some algorithm preloaded into the arbiter. The following
diagram shows this type of system. Note that each device has a
separate line to the bus arbiter. (The bus grant signals have
been omitted for clarity.)
Request
Device 1
Bus
arbiter Request
Request
218
Device 2
...
Device n
Distributed arbitration by self-selection Here, the devices themselves determine which of them has the highest priority. Each
device has a bus request line or lines on which it places a code
identifying itself. Each device examines the codes for all the requesting devices, and determines whether or not it is the highest
priority requesting device.
These arbitration schemes may also be used in conjunction with each
other. For example, a set of similar devices may be daisy chained
together, and this set may be an input to a priority encoded scheme.
There is one other arbitration scheme for serial buses distributed
arbitration by collision detection. This is the method used by the
Ethernet, and it will be discussed later.
219
The I/O address space

Some processors map I/O devices in their own, separate, address
space; others use memory addresses as addresses of I/O ports. Both
approaches have advantages and disadvantages. The advantages of
a separate address space for I/O devices are, primarily, that the I/O
operations would then be performed by separate I/O instructions,
and that all the memory address space could be dedicated to memory.
Typically, however, I/O is only a small fraction of the operations
performed by a computer system; generally less than 1 percent of all
instructions are I/O instructions in a program. It may not be worthwhile to support such infrequent operations with a rich instruction
set, so I/O instructions are often rather restricted.
In processors with memory mapped I/O, any of the instructions
which references memory directly can also be used to reference I/O
ports, including instructions which modify the contents of the I/O
port (e.g., arithmetic instructions.)
220
Some problems can arise with memory mapped I/O in systems which
use cache memory or virtual memory. If a processor uses a virtual
memory mapping, and the I/O ports are allowed to be in a virtual
address space, the mapping to the physical device may not be consistent if there is a context switch or even if a page is replaced.
If physical addressing is used, mapping across page boundaries may
be problematic.
In many operating systems, I/O devices are directly addressable only
by the operating system, and are assigned to physical memory locations which are not mapped by the virtual memory system.
If the memory locations corresponding to I/O devices are cached,
then the value in cache may not be consistent with the new value
loaded in memory. Generally, either there is some method for invalidating cache that may be mapped to I/O addresses, or the I/O
addresses are not cached at all. We will look at the general problem of maintaining cache in a consistent state (the cache coherency
problem) in more detail when we discuss multi-processor systems.
221
In the real world ...

Although we have been discussing fairly complex processors like the
MIPS, the largest market for microprocessors is still for small, simple
processors much like the early microprocessors. In fact, there is still
a large market for 4-bit and 8-bit processors.
These devices are used as controllers for other products. A large part
of their function is often some kind of I/O, from simple switch inputs
to complex signal processing.
One function of such processors is as I/O processors for more sophisticated computers. The following diagram shows the sales of
controllers of various types:
SALES
(billions, US)
12
11.7
9.9
10
8.2
DSP
16/32bit
6.6
6
4.9 5.2
8-bit
4
2
.
4-bit
91 92 93 94 95 96
The projected microcontroller sales for 2001 is 9.8 billion; for 2002,
222
9.6 billion; for 2003, 12.0 billion; for 2004, 13.0 billion; for 2005, 14
billion. (SIA projection.)
For DSP devices, it is 4.9 billion in 2002, 6.5 billion in 2003, 8.4
billion in 2004, and 9.4 billion in 2005.
223
Magnetic disks
A magnetic disk drive consists of a set of very flat disks, called platters, coated on both sides with a material which can be magnetized
or demagnetized.
The magnetic state can be read or written by small magnetic heads
located on mechanical arms which can move in and out over the
surfaces of the disks, very close to but not actually touching, the
surfaces.
224
Each platter containing a number of tracks, and each track containing

a set of sectors.
Platters
Tracks
Sectors
Total storage is
(no. of platters) (no. of tracks/platter) (no. of sectors/track)
Typically, disks are formatted and bad sectors are noted in a table
stored in the controller.
225
Disks spin at speeds of 4200 RPM to 15,000 RPM. Typical speeds for
PC desktops are 7200 RPM and 10,000 RPM. Laptop disks usually
spin at 4200 or 5400 RPM.
Disk speed is usually characterized by several parameters:
average seek time, which is the average required for the read/write
head to be positioned over the correct track, typically about 8ms.
rotational latency which is the average time for the appropriate
sector to rotate to a point under the head, (4.17 ms for a 7200
RPM disk) and
transfer rate, this is about 5Mbytes/second for an early IDE drive.
(33 133 MB/s for an ATA drive, and 150 300 MB/s for a
SATA drive.) Typically, sustained rates are less than half the
maximum rates.
controller overhead also contributes some delay; typically 1 ms.
Assuming a sustained data transfer rate of 50MB/s, the time required
to transfer a 1 Kbyte block is
8ms. + 4.17ms. + 0.02 ms. + 1ms. = 13.2 ms.
To transfer a 1 Mbyte block in the same system, the time required is
8ms. + 4.17ms. + 20 ms. + 1ms. = 33 ms.
Note that for small blocks, most of the time is spent finding the data
to be transferred. This time is the latency of the disk.
226
Latency can be reduced in several ways in modern disk systems.

The disks have built-in memory buffers which store data to be
written, and which also can contain data for several reads at the
same time. (In this case, the reads and writes are not necessarily
performed in the order in which they are received.)
The controller can optimize the seek path (overall seek time) for
a set of reads, and thereby increase throughput.
The system may contain redundant information, and the sector,
or disk, with the shortest access can supply the data.
In fact, systems are often built with large, redundant disk arrays for
several reasons. Typically, security against disk failure and increased
read speed are the main reasons for such systems.
Large disks are now so inexpensive that the Department now uses
large disk arrays as backup storage devices, replacing the slower and
more cumbersome tape drives. Presently, the department maintains
servers with several terabytes of redundant disk.
227
Disk arrays RAID

Disk performance and/or reliability of a disk system can be increased
using an array of disks, possibly with redundancy a Redundant
Array of Independent Disks, or RAID.
Raid systems use two techniques to improve performance and reliability striping and redundancy.
Striping simply allocates successive disk blocks to different disks.
This can increase both read and write performance, since the operations are performed in parallel over several disks.
RAIDs use two different forms of redundancy replication (mirroring) and parity (error correction).
A system with replication simply writes a copy of data on a second
disk the mirror. This increases the performance of read operations, since the data can be read from both disks in parallel. Write
performance is not improved, however.
Also, failure of one disk will not cause the system to fail.
Parity is used to provide the ability to recover data if one disk fails.
This is the way data is most often replicated over several disks.
In the following example, if the parity is even and there is a single bit
missing, then the missing bit can be determined. Here, the missing
bit must be a 1 to maintain even parity.
mn1mn0mn1mn1mn1
mn1mn0mnXmn1mn1
228
mnparity bit
There are several defined levels of RAID, as follows:

RAID 0 has no redundancy, it simply stripes data over the disks
in the array.
RAID 1 uses mirroring. This is full redundancy, and provides tolerance to failure and increased read performance.
RAID 2 uses error correction alone. This type of RAID is no longer
used.
RAID 3 uses bit-interleaved parity. Here each access requires data
from all the disks. The array can recover from the failure of one
disk. Read performance increases because of the parallel access.
RAID 4 uses block-interleaved parity. This can allow small reads
to access fewer disks (e.g., a single block can be read from one
disk). The parity disk is read and rewritten for all writes.
RAID 5 uses distributed block-interleaved parity. this is similar to
RAID 4, but the parity blocks are distributed over all disks. This
can increase write performance as well as read performance. (All
parity writes do not access one only disk.)
Some systems support two RAID levels. The most common example
of this is RAID 0+1. This is a striped, mirrored disk array. It
provides redundancy and parallelism for both reads and writes.
229
Failure tolerance
RAID levels above 0 provide tolerance to single disk failure. Systems
can actually rebuild a file system after the failure of a single disk.
Multiple disk failure generally results in the corruption of the whole
file system.
RAID level 0 actually makes the system more vulnerable to disk
failure failure of a single disk can destroy the data in the whole
array.
For example, assume a disk has a failure rate of 1%. The probability
of a single failure in a 2 disk system is
0.01 + (1. 0.01) 0.01 0.02 = 2%
Consider a RAID 3, 4, or 5 system with 4 disks. Here, 2 disks must
fail at the same time for a system failure.
Consider a 4 disk system with the same failure rate. The probability
of exactly two disks failing (and two not failing) is
(1 0.01)2 (0.01)2 0.0001 = 0.01%
230
Networking the Ethernet

Originally, the physical medium for the Ethernet was a single coaxial
cable with a maximum length of about 500 M. and a maximum of
100 connections.
It was basically a single, high speed (at the time) serial bus network.
It had a particularly simple distributed control mechanism, as well
as ways to extend the network (repeaters, bridges, routers, etc.)
We will describe the original form of the Ethernet, and its modern
switched counterpart.
Coax cable
Terminator
Transcever
tap
Host
station
Host
station
Host
station
In the original Ethernet, every host station (system connected to the

network) was connected through a transceiver cable.
Only one host should talk at any given time, and the Ethernet has a
simple, distributed, mechanism for determining which host can access
the bus.
231
The network used a variable length packet, transmitted serially at

the rate of 10 Mbits/second, with the following format:
Preamble
64 bits
Dest. Source Type

Data
CRC
Addr. Addr. Field
Field
48
48
16 46 1500 bytes 32
The preamble is a synchronization pattern containing alternating 0s

and 1s, ending with 2 consecutive 1s:
101010101010...101011
The destination address is the address of the station(s) to which
the packet is being transmitted. Addresses beginning with 0 are
individual addresses, those beginning with 1 are multicast (group)
addresses, and address 1111...111 is the broadcast address.
The source address is the unique address of the station transmitting
the message.
The type field identifies the high-level protocol associated with the
message. It determines how the data will be interpreted.
The CRC is a polynomial evaluated using each bit in the message,
and is used to determine transmission errors, for data integrity.
The minimum spacing between packets (interpacket delay) is 9.6 s.
From the above diagram, the minimum and maximum packet sizes
are 72 bytes and 1526 bytes, requiring 57.6 s. and 1220.8 s.,
respectively.
232
One of the more interesting features of the Ethernet protocol is the

way in which a station gets access to the bus.
Each station listens to the bus, and does not attempt to transmit
while another station is transmitting, or in the interpacket delay
period. In this situation, the station is said to be deferring.
A station may transmit if it is not deferring. While a station transmits, it also listens to the bus. If it detects an inconsistency between
the transmitted and received data (a collision, caused by another
station transmitting) then the station aborts transmission, and sends
4-6 bytes of junk (a jam) to ensure every other station transmitting
also detects the collision.
Each transmitting station then waits a random time interval before
attempting to retransmit. On consecutive collisions, the size of the
random interval is doubled, to a maximum of 10 collisions. The base
interval is 512 bit times (51.2 s.)
This arbitration mechanism is fair, an not rely on any central arbiter,
and is simple to implement.
While it may seem inefficient, usually there are relatively few collisions, even in a fairly highly loaded network. The average number of
collisions is actually quite small.
Wireless networks currently use a quite similar mechanism for determining which node can transmit a packet. (Normally, only one node
can transmit at a time, and control is distributed.)
233
Current ethernet systems

The modern variant of the Ethernet is quite different (but the same
protocols apply).
In the present system, individual stations are connected to a switch,
using an 8-conductor wire (Cat-5 wire, but only 4 wires are actually
used) which allows bidirectional traffic from the station to the switch
at 100 Mbits/s in either direction.
A more recent variant uses a similar arrangement, with Cat-6 wire,
and can achieve 1 Gbit/second.
Switch
Cat5 wire
Host
station
Host
station
Host
station
A great advantage of this configuration is that there is only one

station on each link, so other stations cannot eavesdrop on the
network communications.
Another advantage of switches is that several pairs of stations can
communicate with each other simultaneously, reducing collisions dramatically.
234
The maximum length of a single link is 100 m., and switches are
often linked by fibre optical cable.
The following pictures show the network in the Department:
The first shows the cable plant (where all the Cat-5 wires are connected to the switches).
The second shows the switches connecting the Department to the
campus network. It is an optical fiber network operating at 10 Gbit/s.
The optical fiber cables are orange.
The third shows the actual switches used for the internal network.
Note the orange optical fibre connection to each switch.
235
236
In the following, note the orange optical fibre cable.
237
238
In the previous picture, there were 8 sets of high-speed switches,

each with 24 1 Gbit/s. ports, and 1 fibre optical port at 10 Gbit/s.
Each switch is interconnected to the others by a backplane connector
which can transfer data at 2 Gbits/s.
The 10 Gbit/s. ports are connected to the departmental servers
which collectively provide several Tera-bytes of redundant (raid) disk
storage for departmental use.
239
Multiprocessor systems
In order to perform computation faster, there are two basic strategies:
Increase the speed of individual computations.
Increase the number of operations performed in parallel.
At present, high-end microprocessors are manufactured using aggressive technologies, so there is relatively little opportunity to take the
first strategy, beyond the general trend (Moores law).
There are a number of ways to pursue the second strategy:
Increase the parallelism within a single processor, using multiple
parallel pipelines and fast access to memory. (e.g., the Cray
computers).
Use multiple commercial processors, each with its own memory
resources, interconnected by some network topology.
Use multiple commercial microprocessors in the same box
sharing memory and other resources.
The first of these approaches was successful for several decades, but
the low cost per unit of commercial microprocessors is so attractive
that the microprocessor based systems have the potential to provide
very high performance computing at relatively low cost.
240
Multiprocessor systems
A multiprocessor system might look as follows:
Processors
Interconnect
Network
Processors
The interconnect network can be switches, single or multiple serial

links, or any other network topology.
Here, each processor has its own memory, but may be able to access
memory from other processors as well.
Also nearby processors may communicate faster than processors that
are further away.
241
An alternate system, (a shared memory multiprocessor system) where

processors share a large common memory, could look as follows:
Processors
Interconnect
Network
Memory
The interconnect network can be switches, single or multiple buses,

or any other topology.
The single bus variant of this type of system is now quite common.
Many manufacturers provide quad, or higher, core processors, and
multiprocessing is supported by many different operating systems.
242
A single bus shared memory multiprocessor system:

Processors
Cache
Global
bus
local
busses
tag
Shared
tag
Memory
tag
tag
Note that here each processor has its own cache. Virtually all current
high performance microprocessors have a reasonable amount of high
speed cache implemented on chip.
In a shared memory system, this is particularly important to reduce
contention for memory access.
243
The cache, while important for reducing memory contention, must

behave somewhat differently than the cache in a single processor
system.
Recall that a cache had four components:
high speed storage
an address mapping mechanism from main memory to cache
a replacement policy for data which is not found in cache
a mechanism for handling writes to memory.
Reviewing the design characteristics of a single processor cache, we
found that performance increased with:
larger cache size
larger line (block) size
larger set size associativity, mapping policy
higher information line replacement policy (miss ratio for LRU
< FIFO < random)
lower frequency cache-to-memory write policy (write-back better
than write-through)
244
Multiprocessor Cache Coherency

In shared memory multiprocessor systems, cache coherency the
fact that several processors can write to the same memory location,
which may also be in the cache of one or more other processors
becomes an issue.
This makes both the design and simulation of multiprocessor caches
more difficult.
Cache coherency solutions
Data and instructions can be classified as
read-only or writable,
and shared or unshared.
It is the shared, writable data which allows the cache to become
incoherent.
There are two possible solutions:
1. Dont cache shared, writable data.
Cache coherency is not an issue then, but performance can suffer
drastically because of uncached reads and writes of shared data.
This approach can be used with either hardware or software.
245
2. Cache shared, writable data and use hardware to maintain cache

coherence.
Again, there are two possibilities for writing data:
(a) data write-through (or buffered write-through)
(This may require many bus operations.)
(b) some variant of write-back
Here, there are two more possibilities:
i. write invalidate the cache associated with a processor
can invalidate entries written by other processors, or
ii. write update the cache can update its value with that
written in the other cache.
Most commercial bus-based multi-processors use a write-back
cache with write invalidate. This generally reduces the bus traffic.
Note that to invalidate a cache entry on a write, the cache only
needs to receive the address from the bus.
To update, the cache also needs the data from the other cache,
as well.
246
One example of an invalidating policy is the write-once policy

a cache writes data back to memory one time (the first write) and
when the line is flushed. On the first write, other caches in the
system holding this data mark their entries invalid.
A cache line can have 4 states:
invalid cache data is not correct
clean memory is up-to-date, data may be in other caches
reserved memory is up-to-date; no other caches have this data
dirty memory is incorrect; no other cache holds this data
As an exercise, try to determine what happens in the caches for all
possible transitions.
This is an example of a snoopy protocol the cache obtains its
state information by listening to, or snooping the bus.
247
The Intel Pentium class processors use a similar cache protocol called
the MESI protocol. Most other single-chip multiprocessors use this,
or a very similar protocol, as well.
The MESI protocol has 4 states:
modified the cache line has been modified, and is available only
in this cache (dirty)
exclusive no other caches have this line, and it is consistent
with memory (reserved)
shared line may also be in other caches, memory is up-to-date
(clean)
invalid cache line is not correct (invalid)
M
Cache line valid?
Yes
Yes
Yes
No
Memory is
Stale Valid
Valid
Multiple cache copies?
No
No
Maybe Maybe
What happens for read and write hits and misses?
248
Read hit
For a read hit, the processor takes data directly from the local cache
line, as long as the line is valid. (If it is not valid, it is a cache miss,
anyway.)
Read miss
Here, there are several possibilities:
If no other cache has the line, the data is taken from memory
and marked exclusive.
If one or more caches have a clean copy of this line, (either in
states exclusive or shared) they should signal the cache, and
each cache should mark its copy as shared.
If a cache has a modified copy of this line, it signals the cache
to retry, writes its copy to memory immediately, and marks its
cached copy as shared. (The requesting cache will then read the
data from memory, and have it marked as shared.)
249
Write hit
The processor marks the line in cache as modified. If the line was
already in state modified or exclusive, then that cache has the only
copy of the data, and nothing else need be done. If the line was in
state shared, then the other caches should mark their copies invalid.
(A bus transaction is required).
Write miss
The processor first reads the line from memory, then writes the word
to the cache, marks the line as modified, and performs a bus transaction so that if any other cache has the line in the shared or exclusive
state it can be marked invalid.
If, on the initial read, another cache has the line in the modified
state, that cache marks its own copy invalid, suspends the initiating
read, and immediately writes its value to memory. The suspended
read resumes, getting the correct value from memory. The word can
then be written to this cache line, and marked as modified.
250
False sharing
One type of problem possible in multiprocessor cache systems using
a write-invalidate protocol is false sharing.
This occurs with line sizes greater than a single word, when one processor writes to a line that is stored in the cache of another processor.
Even if the processors do not share a variable, the fact that an entry
in the shared line has changed forced the caches to treat the line as
shared.
It is instructive to consider the following example (assume a line size
of 4 32 bit words, and that all caches initially contain clean, valid
data):
Step Processor Action address
1
P1
write
100
P2
write
104
P1
read
100
P2
read
104
Note that addresses 100 and 104 are in the same cache line (the line
is 4 words or 16 bytes, and the addresses are in bytes).
Consider the MESI protocol, and determine what happens at each
step.
251
Example a simple multiprocessor calculation

Suppose we want to sum 16 million numbers on 16 processors in a
single bus multiprocessor system.
The first step is to split the set of numbers into subsets of the same
size. Since there is a single, common memory for this machine, there
is no need to partition the data; we just give different starting addresses in the array to each processor. Pn is the number of the
processor, between 0 and 15.
All processors start the program by running a loop that sums their
subset of numbers:
tmp
= 0;
for (i = 1000000 * Pn; i < 1000000 * (Pn+1);

i = i + 1) {
tmp = tmp + A[i];
/* sum the assigned areas*/
}
sum[Pn] = tmp
This loop uses load instructions to bring the correct subset of numbers to the caches of each processor from the common main memory.
Each processor must have its own version of the loop counter variable
i, so it must be a private variable. Similarly for the partial sum,
tmp. The array sum[Pn] is a global array of partial sums, one from
each processor.
252
The next step is to add these many partial sums, using divide and
conquer. Half of the processors add pairs of partial sums, then a
quarter add pairs of the new partial sums, and so on until we have
the single, final sum.
In this example, the two processors must synchronize before the
consumer processor tries to read the result written to memory
by the producer processor; otherwise, the consumer may read the
old value of the data. Following is the code (half is private also):
half = 16;
/* 16 processors in multiprocessor*/
repeat
synch();
if (half%2
/* wait for partial
sum completion*/
!=0 && Pn == 0)
sum[0] = sum[0] + sum[half-1];

half = half/2;
/* dividing line on who sums */
if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half];

until
(half == 1);
/* exit with final sum in Sum[0]*/
Note that this program used a barrier synchronization primitive,

synch(); processors wait at the barrier until every processor has
reached it, then they proceed.
This function can be implemented either in software with the lock
synchronization primitive, described shortly, or with special hardware
(e.g., combining each processor ready signal into a single global
signal that all processors can test.)
253
Does the parallelization actually speed up this computation?

It is instructive to calculate the time required for this calculation,
assuming that none of the data values have been read previously
(and are not in the caches for each processor).
In this case, each memory location is accessed exactly once, and
access to memory determines the speed of each processor. Assume
that a single data word is taken from memory on each bus cycle. The
overall calculation requires 16 million memory bus cycles, plus the
additional cycles required to sum the 16 partial values.
Note that if the computation has been done by a single processor,
the time required would have been only the time to access the 16
million data elements.
Of course, if each of the processors already had the requisite 1 million
data elements in their local memory (cache), then the speedup would
be achieved.
It is instructive to consider what would be the execution time for
other scenarios; for example, if the line size for each cache were four
words, or if each cache could hold,say, 2 million words, and the data
had been read earlier and was already in the cache.
254
Synchronization Using Cache Coherency

A key requirement of any multiprocessor system (or, in fact, any processor system that allows multiple processes to proceed concurrently)
is to be able to coordinate processes that are working on a common
task.
Typically, a programmer will use lock variables (or semaphores) to
coordinate or synchronize the processes.
Arbitration is relatively easy for single-bus multiprocessors, since the
bus is the only path to memory: the processor that gets the bus
locks out all other processors from memory. If the processor and bus
provide an atomic test-and-set operation, programmers can create
locks with the proper semantics. Here the term atomic means indivisible, so the processor can both read a location and set it to the
locked value in the same bus operation, preventing any other processor or I/O device from reading or writing memory until the operation
completes.
The following diagram shows a typical procedure for locking a variable using a test-and-set instruction. Assume that 0 means unlocked
(go) and 1 means locked (stop). A processor first reads the lock
variable to test its state. It then keeps reading and testing until the
value indicates that the lock is unlocked.
255
The processor then races against all other processors that were similarly spin waiting to see who can lock the variable first. All processors
use a test-and-set instruction that reads the old value and stores a
1 (locked) into the lock variable. The single winner will see the 0
(unlocked), and the losers will see a 1 that was placed there by the
winner. (The losers will continue to write the variable with the locked
value of 1, but that doesnt change its value.) The winning processor
then executes the code that updates the shared data. When the winner exits, it stores a 0 (unlocked) into the lock variable, thereby
starting the race all over again.
The term usually used to describe the code segment between the lock
and the unlock is a critical section.
256
Load lock
variable S
No
Unlocked?
S=0
Yes
Try to lock variable

using test-and-set
(set S = 1)
No
Succeed?
(S = 0)
Yes
Access shared
resource
Critical
section
Unlock
(Set S = 0)
Compete for
lock
257
Let us see how this spin lock scheme works with bus-based cache
coherency.
One advantage of this scheme is that it allows processors to spin wait
on a local copy of the lock in their caches. This dramatically reduces
the amount of bus traffic; The following table shows the bus and
cache operations for multiple processors trying to lock a variable.
Once the processor with the lock stores a 0 into the lock, all other
caches see that store and invalidate their copy of the lock variable.
Then they try to get the new value of 0 for the lock. (With write
update cache coherency, the caches would update their copy rather
than first invalidate and then load from memory.) This new value
starts the race to see who can set the lock first. The winner gets the
bus and stores a 1 into the lock; the other caches replace their copy
of the lock variable containing 0 with a 1.
They read that the variable is already locked and must return to
testing and spinning.
Because of the communication traffic generated when the lock is
released, this scheme has difficulty scaling up to many processors.
258
Step P0
P1
Spin (test if Spin
Has lock
P2
Bus Activity
None
lock=0)
2
Set lock=0 Spin
Spin
Write-
and 0 sent
invalidate of
over bus
lock variable
from P0
Cache miss
Cache miss
Bus decides
to service P2
cache miss
(Waits while Lock = 0
Cache miss
bus busy)
for P2 satisfied
Lock = 0
Swap: reads Cache miss

lock and sets for P2 satisto 1
fied
Swap: reads Value from Writelock and sets swap

to 1
=0 invalidate of
and 1 sent lock variable

over bus
Value from Owns

swap = 1 lock,
from P2
the Writeso invalidate of
and 1 sent can update lock variable

over bus
8
shared data
Spins
from P1
None
259
Multiprocessing without shared memory networked

processors
Processors
Interconnect
Network
Processors
Here the interconnect can be any desired interconnect topology.
260
The following diagrams show some useful network topologies. Typically, a topology is chosen which maps onto features of the program
or data structures.
Ring
1D mesh
2D mesh
2D torus
Tree
3D grid
Some parameters used to characterize network graphs include:

bisection bandwidth the minimum number of links which must
be removed to partition the graph
network diameter the maximum of the minimum distance
between two nodes
261
In the following, the layout area is (eventually) dominated by the interconnections:
Hypercube
Butterfly
262
Let us assume a simple network; for example, a single high-speed

Ethernet connection to a switched hub.
(This is a common approach for achieving parallelism in Linux systems. Parallel systems like this are often called Beowulf clusters.)
Processor
Processor
Processor
Processor
Switch
Processor
Processor
Processor
Processor
Note that in this configuration, any two processors can communicate

with each other, simultaneously. Collisions (blocking) only occur
when two processors attempt to send messages to the same processor.
(This is equivalent to a permutation network, if a single switch is
sufficient.)
263
Parallel Program (Message Passing)

Let us reexamine our summing example again for a network-connected
multiprocessor with 16 processors each with a private memory.
Since this computer has multiple address spaces, the first step is distributing the 16 subsets to each of the local memories. The processor
containing the 16,000,000 numbers sends the subsets to each of the
16 processor-memory nodes.
Let Pn represent the number of the execution unit, send(x,y,len)
be a routine that sends over the interconnection network to execution
unit number x the list y of length len words, and receive(x) be
a function that accepts this list from the network for this execution
unit:
procno = 16;
/* 16 processors
*/
for (Pn = 0; Pn < procno; Pn = Pn + 1

send(Pn, A(Pn*1000000), 1000000);
264
The next step is to get the sum of each subset. This step is simply
a loop that every execution unit follows; read a word from local
memory and add it to a local variable:
receive(A1);
sum = 0;
for (i = 0; i<1000000; i = i + 1)
sum = sum + A1[i];
/* sum the local arrays */
Again, the final step is adding these 16 partial sums. Now, each
partial sum is located in a different execution unit. Hence, we must
use the interconnection network to send partial sums to accumulate
the final sum.
Rather than sending all the partial sums to a single processor, which
would result in sequentially adding the partial sums, we again apply
divide and conquer. First, half of the execution units send their
partial sums to the other half of the execution units, where two partial
sums are added together. Then one quarter of the execution units
(half of the half) send this new partial sum to the other quarter of the
execution units (the remaining half of the half) for the next round of
sums.
265
This halving, sending, and receiving continues until there is a single

sum of all numbers.
limit = 16;
half = 16;
/* 16 processors */
repeat
half = (half+1)/2;
/* send vs. receive dividing line*/
if (Pn >= half && Pn < limit)
send(Pn - half, sum, 1);
receive(tmp);
if (Pn < (limit/2-1))
sum = sum + tmp;
limit = half;
/* upper limit of senders */
until
/* exit with final
(half == 1);
sum */
This code divides all processors into senders or receivers and each
receiving processor gets only one message, so we can presume that a
receiving processor will stall until it receives a message. Thus, send
and receive can be used as primitives for synchronization as well as
for communication, as the processors are aware of the transmission
of data.
266
How much does parallel processing help?

In the previous course, we met Amdahls Law, which stated that, for
a given program and data set, the total amount of speedup of the
program is limited by the fraction of the program that is serial in
nature.
If P is the fraction of a program that can be parallelized, and the
serial (non-parallelizable) fraction of the code is 1 P then the total
time taken by the parallel system is (1 P ) + P/N . The speedup
S(N ) with N processors is therefore
S(N ) =
1
(1 P ) + P/N
As N becomes large, this approaches 1/(1 P ).

So, for a fixed problem size, the serial component of a program limits
the speedup.
Of course, if the program has no serial component, then this is not
a problem. Such programs are often called trivially parallelizable,
but many interesting problems are not of this type.
Although this is a pessimistic result, in reality it may be possible to
do better, just not for a fixed problem size.
267
Gustafsons law
One of the advantages of increasing the amount of computation available for a problem is that problems of a larger size can be attempted.
So, rather than keeping the problem size fixed, suppose we can formulate the problem to try to use parallelism to solve a larger problem
in the same amount of time. (Gustafson called this scaled speedup).
The idea here is that, for certain problems, the serial part use a nearly
constant amount of time, while the parallel part can scale with the
number of processors.
Assume that a problem has a serial component s and a parallel component p.
So, if we have N processors, the time to complete the equivalent
computation on a single processor is s + N p.
The speedup S(N ) is:
(single processor time)/(N processor time) or (s + N p)/(s + p)
Letting be the sequential fraction of the parallel execution time,
= s/(s + p)
then S(N ) = + N (1 )
If is small, then S(N ) N .
For problems fitting this model, the speedup is really the best one
can hope from applying N processors to a problem.
268
So, we have two models for analyzing the potential speedup for parallel computation.
They differ in the way they determine speedup.
Let us think of a simple example to show the difference between the
two:
Consider booting a computer system. It may be possible to reduce
the time required somewhat by running several processes simultaneously, but the serial nature will pose a lower limit on the amount of
time required. (Amdhals Law).
Gustafsons Law would say that, in the same time that is required
to boot the processor, more facilities could be made available; for
example, initiating more advanced window managers, or bringing up
peripheral devices.
A common explanation of the difference is:
Suppose a car is traveling between two cities 90 km. apart. If the
car travels at 60 km/h for the first hour, then it can never average
100 km/h between the two cities. (Amdhals Law.)
Suppose a car is traveling at 60 km. per hour for the first hour. If it
is then driven consistently at a speed greater than 100 km/h, it will
eventually average 100 km/h. (Gustafsons Law.)
269

All

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

All

Hochgeladen von

Copyright:

Verfügbare Formate

Computer Science 3725

Winter Semester, 2015

Pipelining Instruction Set Parallelism (ISP)

time (clock cycles)

What is required to pipeline the datapath?

Note that in a pipelineds implementation, every instruction passes

Representing a pipeline pictorially

Often an even simpler representation is sufficient:

The following figure shows a pipeline with several instructions in

What happens to the instructions in the pipeline following a successful branch?

Another possibility is to execute the instructions in the pipeline. It

branch delay slot

add $r5, $r2, $r3

The following would produce a correct result:

The following figure shows a series of pipeline hazards.

add $2, $1, $3

sub $5, $2, $3

and $7, $6, $2

beq $0, $2, 25

Handling data hazards

add $2, $1, $3

sub $5, $2, $3

and $7, $6, $2

beq $0, $2, 25

sw $7, 100( $2)

There would also have to be a forwarding unit which provides

The register control signals ForwardA and ForwardB have values

MEM/WB Operand forwarded from a memory

EX/MEM Operand forwarded from the previous ALU operation

For hazards with the MEM/WB stage, an additional constraint is

Forwarding for other instructions

Note that this situation can also be resolved by forwarding.

There is a situation which cannot be handled by forwarding, however.

With a stall, forwarding is now possible.

The condition under which the hazard detection circuit is required

Forwarding with branches

beq $2, $3,25

Here, if the conditional test is done in the ID stage, there is a hazard

In order to correctly implement this instruction in a processor with

Exceptions and interrupts

The datapath, with exception handling for overflow:

Interrupts can be handled in a way similar to that for exceptions.

Superscalar and superpipelined processors

Dynamic pipeline scheduling

A generic view of the Pentium P-X and the Power PC

One of the more important ways in which modern processors keep

Effects of Instruction Set Parallelism on programs

Let us write this code in simple MIPS assembler, assuming that

# load X[i] into register $t0

# load Y[i] into register $t1

$t0, $t0, $s2

$t1, $t0, $t1

# $t1 = A * X[i] + Y[i]

# store result in Y[i]

addi $s3, $s3, 4

# increment pointer to array X

addi $s4, $s4, 4

# increment pointer to array Y

# jump back until the counter

$s0, $s1, Loop

# load X[i] into register $t0

# load Y[i] into register $t1

addi $s3, $s3, 4

# increment pointer to array X