Beruflich Dokumente
Kultur Dokumente
A pipelined implementation
WB
MEM
ALU
RD
IF
0
9 10 11 12 13
1
M
U
X
0
Add
4
Add
Shift
left 2
Inst[2521]
PC
Read
address
Instruction
[310]
Instruction
Memory
Inst[2016]
Read
Register 1
Read
Register 2
Zero
Registers
Write
Read
Register
data 2
Write
data
Inst[150]
16
Inst[1511]
IF
Read
data 1
ID
Sign
extend
ALU
Address
0
M
U
X
1
Read
data
Write
data
Data
Memory
0
M
U
X
1
32
0
M
U
X
1
EX
MEM
WB
It is useful to note the changes that have been made to the datapath,
The most obvious change is, of course, the addition of the pipeline
registers.
The addition of these registers introduce some questions.
How large should the pipeline registers be?
Will they be the same size in each stage?
The next change is to the location of the MUX that updates the PC.
This must be associated with the IF stage. In this stage, the PC
should also be incremented.
The third change is to preserve the address of the register to be
written in the register file. This is done by passing the address along
the pipeline registers until it is required in the WB stage.
The output of the MUX which provides the write address is now the
pipeline register.
Pipeline control
Since five instructions are now executing simultaneously, the controller for the pipelined implementation is, in general, more complex.
It is not as complex as it appears on first glance, however.
For a processor like the MIPS, it is possible to decode the instruction
in the early pipeline stages, and to pass the control signals along the
pipeline in the same way as the data elements are passed through
the pipeline.
(This is what will be done in our implementation.)
A variant of this would be to pass the instruction field (or parts of
it) and to decode the instruction as needed for each stage.
For our processor example, since the datapath elements are the same
as for the single cycle processor, then the control signals required
must be similar, and can be implemented in a similar way.
All the signals can be generated early (in the ID stage) and passed
along the pipeline until they are required.
W
B
PCSrc
M
E
M
1
M
U
X
0
W
B
RegDst
M MemRead
E MemWrite
M Branch
E ALUSrc
X ALUop
RegWrite
W
B MemtoReg
Inst [3126]
Add
Add
4
Shift
left 2
Inst[2521]
PC
Read
address
Instruction
[310]
Instruction
Memory
Inst[2016]
Read
Register 1
Read
Register 2
Read
data 1
Zero
Registers
Write
Read
Register
data 2
Write
data
Inst[150]
16
Sign
extend
ALU
IF
ID
Read
data
Write
data
Data
Memory
0
M
U
X
1
32
Inst[50]
Inst[1511]
Address
0
M
U
X
1
ALU
control
0
M
U
X
1
EX
MEM
WB
Executing an instruction
In the following figures, we will follow the execution of an instruction
through the pipeline.
The instructions we have implemented in the datapath are those of
the simplest version of the single cycle processor, namely:
the R-type instructions
load
store
beq
We will follow the load instruction, as an example.
10
1
M
U
X
0
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add
4
Shift
left 2
Inst[2521]
PC
Read
address
Instruction
[310]
11
Instruction
Memory
Inst[2016]
Read
Register 1
Read
Register 2
Read
data 1
Zero
Registers
Write
Read
Register
data 2
Write
data
Inst[150]
16
Sign
extend
ALU
IF
ID
Read
data
Write
data
Data
Memory
0
M
U
X
1
32
Inst[50]
Inst[1511]
Address
0
M
U
X
1
ALU
control
0
M
U
X
1
EX
MEM
WB
1
M
U
X
0
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
4
Add
Shift
left 2
Inst[2521]
PC
Read
address
Instruction
[310]
12
Instruction
Memory
Inst[2016]
Read
Register 1
Read
Register 2
Read
data 1
Zero
Registers
Write
Read
Register
data 2
Write
data
Inst[150]
16
Sign
extend
ALU
IF
ID
Read
data
Write
data
Data
Memory
0
M
U
X
1
32
Inst[50]
Inst[1511]
Address
0
M
U
X
1
ALU
control
0
M
U
X
1
EX
MEM
WB
1
M
U
X
0
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
4
Add
Shift
left 2
Inst[2521]
PC
Read
address
Instruction
[310]
13
Instruction
Memory
Inst[2016]
Read
Register 1
Read
Register 2
Read
data 1
Zero
Registers
Write
Read
Register
data 2
Write
data
Inst[150]
16
Sign
extend
ALU
IF
ID
Read
data
Write
data
Data
Memory
0
M
U
X
1
32
Inst[50]
Inst[1511]
Address
0
M
U
X
1
ALU
control
0
M
U
X
1
EX
MEM
WB
1
M
U
X
0
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
4
Add
Shift
left 2
Inst[2521]
PC
Read
address
Instruction
[310]
14
Instruction
Memory
Inst[2016]
Read
Register 1
Read
Register 2
Read
data 1
Zero
Registers
Write
Read
Register
data 2
Write
data
Inst[150]
16
Sign
extend
ALU
IF
ID
Read
data
Write
data
Data
Memory
0
M
U
X
1
32
Inst[50]
Inst[1511]
Address
0
M
U
X
1
ALU
control
0
M
U
X
1
EX
MEM
WB
1
M
U
X
0
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
4
Add
Shift
left 2
Inst[2521]
PC
Read
address
Instruction
[310]
15
Instruction
Memory
Inst[2016]
Read
Register 1
Read
Register 2
Read
data 1
Zero
Registers
Write
Read
Register
data 2
Write
data
Inst[150]
16
Sign
extend
ALU
IF
ID
Read
data
Write
data
Data
Memory
0
M
U
X
1
32
Inst[50]
Inst[1511]
Address
0
M
U
X
1
ALU
control
0
M
U
X
1
EX
MEM
WB
LW
IM
REG
SW
ALU
IM
REG
ADD
IM
DM
REG
ALU
DM
REG
ALU
DM
REG
IF
ID
IF
ALU MEM WB
ID
IF
ALU MEM WB
ID
ALU MEM WB
16
REG
LW
ADD
SW
IM
REG
IM
ALU
REG
IM
DM
REG
ALU
DM
REG
ALU
DM
REG
ALU
DM
REG
ALU
DM
REG
ALU
DM
REG
17
SUB
BEQ
AND
IM
REG
IM
REG
IM
REG
REG
Pipeline hazards
There are three types of hazards in pipelined implementations
structural hazards, control hazards, and data hazards.
Structural hazards
Structural hazards occur when there are insufficient hardware resources to support the particular combination of instructions presently
being executed.
The present implementation has a potential structural hazard if there
is a single memory for data and instructions.
Other structural hazards cannot happen in a simple linear pipeline,
but for more complex pipelines they may occur.
Control hazards
These hazards happen when the flow of control changes as a result
of some computation in the pipeline.
One question here is what happens to the rest of the instructions in
the pipeline?
Consider the beq instruction.
The branch address calculation and the comparison are performed in
the EX cycle, and the branch address returned to the PC in the next
cycle.
18
IF
ID
stall
add
ALU MEM WB
stall
stall
IF
ID
IF
lw
ALU MEM WB
ID
ALU MEM WB
It can also be done by the compiler, by placing several nop instructions following a branch. (It is not called a pipeline stall then.)
beq
nop
nop
nop
IF
ID
IF
ALU MEM WB
ID
IF
ALU MEM WB
ID
ALU MEM WB
IF
ID
add
IF
ALU MEM WB
ID
IF
lw
19
ALU MEM WB
ID
ALU MEM WB
IF
ID
ALU MEM WB
IF
ID
instruction at
branch target
IF
ALU MEM WB
ID
ALU MEM WB
We saw earlier that branches are quite common, and inserting many
stalls or nops is inefficient.
For long pipelines, however, it is difficult to find useful instructions to
fill several branch delay slots, so this idea is not used in most modern
processors.
20
Branch prediction
If branches could be predicted, there would be no need for stalls.
Most modern processors do some form of branch prediction.
Perhaps the simplest is to predict that no branch will be taken.
In this case, the pipeline is flushed if the branch prediction is wrong,
and none of the results of the instructions in the pipeline are written
to the register file.
How effective is this prediction method?
What branches are most common?
Consider the most common control structure in most programs
the loop.
In this structure, the most common result of a branch is that it is
taken; consequently the next instruction in memory is a poor prediction. In fact, in a loop, the branch is not taken exactly once at
the end of the loop.
A better choice may be to record the last branch decision, (or the
last few decisions) and make a decision based on the branch history.
Branches are problematic in that they are frequent, and cause inefficiencies by requiring pipeline flushes. In deep pipelines, this can be
computationally expensive.
21
Data hazards
Another common pipeline hazard is a data hazard. Consider the
following instructions:
add $r2, $r1, $r3
add $r5, $r2, $r3
Note that $r2 is written in the first instruction, and read in the
second.
In our pipelined implementation, however, $r2 is not written until
four cycles after the second instruction begins, and therefore three
bubbles or nops would have to be inserted before the correct value
would be read.
add $r2, $r1, $r3
IF
ID
ALU MEM WB
data hazard
IF
ID
ALU MEM WB
ID
nop
ALU MEM WB
nop
nop
IF
ID
ALU MEM WB
22
23
sw $7, 100($2)
IM
REG
IM
ALU
REG
IM
DM
REG
ALU
DM
REG
ALU
DM
REG
ALU
DM
REG
ALU
DM
REG
IM
REG
IM
REG
REG
24
IM
REG
IM
ALU
REG
IM
DM
REG
ALU
DM
REG
ALU
DM
REG
ALU
DM
REG
ALU
DM
REG
25
IM
REG
IM
forwarding
Note how forwarding eliminates the data hazards in these cases.
REG
REG
Implementing forwarding
Note that from the previous examples there are now two potential
additional sources of operands for the ALU during the EX cycle
the EX/MEM pipeline register, and the the MEM/WB pipeline.
What additional hardware would be required to provide the data
from the pipeline stages?
The data to be forwarded could be required by either of the inputs
to the ALU, so two MUXs would be required one for each ALU
input.
The MUXs would have three sources of data; the original data from
the registers (in pipeline stage ID/EX) or the two pipeline stages to
be forwarded.
Looking only at the datapath for R-type operations, the additional
hardware would be as follows:
26
ID/EX
Read R1
EX/MEM
MEM/WB
M
U
X
Read
Data 1
Read R2
zero
ForwardA
Registers
Write R
Write data
ALU
result
Read
M
U
X
Data 2
Read
address
Write
Read
Data
Memory
Write
Data
ForwardB
rt
rd
1
M
U
X
27
Data
M
U
X
Forwarding control
Under what conditions does a data hazard (for R-type operations)
occur?
It is when a register to be read in the EX cycle is the same register
as one targeted to be written, and is held in either the EX/MEM
pipeline register or the MEM/WB pipeline register.
These conditions can be expressed as:
1. EX/MEM.RegisterRd = ID/EX.RegisterRs or ID/EX.RegisterRt
2. MEM/WB.RegisterRd = ID/EX.RegisterRs or ID/EX.RegisterRt
Some instructions do not write registers, so the forwarding unit
should check to see if the register actually will be written. (If it
is to be written, the control signal RegWrite, also in the pipeline,
will be set.)
Also, an instruction may try to write some value in register 0. More
importantly, it may try to write a non-zero value there, which should
not be forwarded register 0 is always zero.
Therefore, register 0 should never be forwarded.
28
ID/EX
Explanation
Operand comes from the register file
(no forwarding)
01
10
The conditions for a hazard with a value in the EX/MEM stage are:
if (EX/MEM.RegWrite
and (EX/MEM.RegisterRd 6= 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs))
then ForwardA = 10
if (EX/MEM.RegWrite
and (EX/MEM.RegisterRd 6= 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
then ForwardB = 10
29
The datapath with the forwarding control is shown in the next figure.
30
ID/EX
Read R1
EX/MEM
M
U
X
Read
Data 1
Read R2
zero
ForwardA
Registers
Write R
Write data
MEM/WB
ALU
result
Read
Data 2
M
U
X
Write
Read
Data
Data
Memory
Write
Data
ForwardB
rs
rt
Read
address
rd
1
EX/MEM.RegisterRd
M
U
X
Forwarding
unit
MEM/WB.RegisterRd
For a datapath with forwarding, the hazards which are fixed by forwarding are not considered hazards any more.
31
M
U
X
lw $2, 100($3)
sw $2, 400($3)
IM
REG
IM
ALU
REG
DM
REG
ALU
DM
REG
IM
add $4,$3, $2
REG
IM
ALU
REG
DM
REG
ALU
DM
REG
Here, the data from the load is not ready when the r-type instruction
requires it we have a hazard.
What can be done here?
lw $2, 100($3)
IM
REG
STALL
ALU
DM
IM
REG
REG
ALU
DM
add $4,$3, $2
33
REG
34
IM
REG
ALU
IM
REG
DM
REG
ALU
DM
REG
In the MIPS processor, however, the branch instructions were implemented to require only two cycles. The instruction following the
branch was always executed. (The compiler attempted to place a
useful instruction in this jump delay slot, but if it could not, an
nop was placed there.)
The original MIPS did not have forwarding, but it is useful to consider
the kinds of hazards which could arise with this instruction.
Consider the sequence
add $2, $3, $4
beq $2, $5, 25
IF
ID
IF
ALU MEM WB
ID
ALU MEM WB
36
EX.Flush
ID.Flush
Hazard
detection
unit
40000040
ID/EX
M
u
x
WB
Control
0
M
u
x
M
u
x
M
0
EX
IF/ID
EX/MEM
M
u
x
Cause
WB
MEM/WB
WB
Except
PC
4
Shift
left 2
Registers
PC
M
u
x
Instruction
memory
ALU
M
u
x
Sign
extend
M
u
x
Forwarding
unit
38
Data
memory
M
u
x
39
40
41
Instruction fetch
and decode unit
Functional
units
In-order issue
Reservation
station
Reservation
station
Reservation
station
Reservation
station
Integer
Integer
Floating
point
Load/
Store
Out-of-order execute
In-order commit
Commit
unit
Dynamic pipeline scheduling is used in the three most popular processors in machines today the Pentium II, III, and IV machines,
the AMD Athlon, and the Power PC.
42
Data
cache
PC
Instruction
cache
Branch
prediction
Instruction queue
Register file
Decode/dispatch unit
Reservation
station
Branch
Reservation
station
Integer
Reservation
station
Integer
Reservation
station
Floating
point
Commit
unit
Reorder
buffer
43
Reservation
station
Reservation
station
Store
Load
Complex
integer
Load/
store
Speculative execution
44
45
Loop:
add
$s0, $0, $0
# initialize i to 0
lw
$t0, 0($s3)
lw
$t1, 0($s4)
addi
$s0, $s0, 1
# increment i
mul
# $t0 = A * X[i]
add
sw
$t1, 0($s4)
bne
# reaches N
This is a fairly direct implementation of the loop, and is not the most
efficient code.
For example, the variable i need not be implemented in this code,
we could use the array index for one of the vectors instead, and use
the final array address (+4) as the termination condition.
Also, this code has numerous data dependencies, some of which may
be reduced by reordering the code.
46
Using this idea, register $s1 would now be set by the compiler to
have the value of the start of array X (or Y), plus 4 (N + 1).
Reordering, and rescheduling the previous code for the MIPS:
Loop:
lw
$t0, 0($s3)
lw
$t1, 0($s4)
mul
# $t0 = A * X[i]
nop
nop
# on $t0
nop
add
nop
$ as above
nop
bne
sw
$t1, -4($s4)
Note that variable i is no longer used, the code is somewhat reordered, nop instructions are added to preserve the correct execution
of the code, and the single branch delay slot is used. The dependencies causing hazards relate to registers $t0 and $t1. (This still may
not be the most efficient schedule.)
A total of 5 nop instructions are used, out of 13 instructions.
This is not very efficient!
47
Loop unrolling
Suppose we rewrite this code to correspond to the following C program:
for (i = 0; i < N; i+=2)
{
Y[i] = A * X[i] + Y[i];
Y[i+1] = A * X[i+1] + Y[i+1];
}
As long as the number of iterations is a multiple of 2, this is equivalent.
This loop is said to be unrolled once. Each iteration of this loop
does the same computation as two of the previous iterations.
A loop can be unrolled multiple times.
Does this save any execution time? If so, how?
48
lw
$t0, 0($s3)
lw
$t2, 4($s3)
lw
$t1, 0($s4)
lw
$t3, 4($s4)
mul
# $t0 = A * X[i]
mul
# $t2 = A * X[i+1]
add
add
nop
# $t0 dependency
nop
sw
$t1, -8($s4)
bne
sw
$t2, -4($s4)
This code requires 15 instruction executions to complete two iterations of the original loop; the original loop required 2 13, or 26
instruction executions to do the same computation.
With additional unrolling, all nop instructions could be eliminated.
49
Loop merging
Consider the following computation using two SAXPY loops:
for (i = 0; i < N; i++)
{
Y[i] = A * X[i] + Y[i];
}
for (i = 0; i < N; i++)
{
Z[i] = A * X[i] + Z[i];
}
Clearly, it is possible to combine both those loops into one, and it
would be obviously a bit more efficient (only one branch, rather than
two).
for (i = 0; i < N; i++)
{
Y[i] = A * X[i] + Y[i];
Z[i] = A * X[i] + Z[i];
}
In fact, on a pipelined processor this may be much more efficient
than the original. This code segment can achieve the same savings
as a single loop unrolling on the MIPS processor.
50
51
# Text section
.align 2
.globl main
.ent main
main:
subiu $sp,$sp,32
sw
sw
li
$a0, 10
jal
fact
$a0, $LC
jal printf
$ra, 20($sp)
lw
$fp, 16($sp)
jr
$ra
.rdata
$LC:
.ascii "The factorial of 10 is "
Now the factorial function itself, first setting up the function call
stack, then evaluating the function, and finally restoring saved register values and returning:
# factorial function
.text
# Text section
fact:
subiu $sp,$sp,32
sw
$ra, 20($sp)
sw
$fp, 16($sp)
sw
$a0, 0($fp)
54
$a0, $L2
# Branch if n > 0
li
$v0, 1
# Return 1
jr
$L1
# do recursion
$L2:
subiu $a0, $a0, 1
# subtract 1 from n
jal
fact
$v1, 0($fp)
mul
$L1:
# result is in $2
lw
$ra, 20($sp)
lw
$fp, 16($sp)
# pop stack
jr
$ra
For this simple example, the data dependency in the recursion relates
to register $v1.
55
56
Taken
Weakly taken
Not taken
Predict taken
Predict taken
Taken
Taken
Not taken
Taken
58
Memory
Memory is often the largest single component in a system, and consequently requires some special care in its design. Of course, it is
possible to use the simple register structures we have seen earlier,
but for large blocks of memory these are usually wasteful of chip
area.
For designs which require large amounts of memory, it is typical to
use standard memory chips these are optimized to provide large
memory capacity at high speed and low cost.
There are two basic types of memory static and dynamic. The
static memory is typically more expensive and with a much lower
capacity, but very high access speed. (This type of memory is often
used for high performance cache memories.) The single transistor
dynamic memory is usually the cheapest RAM, with very high capacity, but relatively slow. It also must be refreshed periodically (by
reading or writing) to preserve its data. (This type of memory is
typically used for the main memory in computer systems.)
The following diagrams show the basic structures of some commonly
used memory cells in random access memories.
59
Static memory
X-enable
VDD
M5
s
s
M1
s
M7
data
M4
M2
s
s
M6
M3
s
M8
Y-enable
data
This static memory cell is effectively an RS flip flop, where the input
transistors are shared by all memory cells in the same column.
The location of a particular cell is determined by the presence of data,
through the y-enable signal and the opening of the pass transistors
to the cell, (M5, M6) by the x-enable signal.
60
M9
M10
VDD
X-enable
M5
M1
s
M7
data
s s
s
M6
M3
s
M8
Y-enable
data
61
X-enable
M5
s
s
W
s
s
M7
data in
M3
Refresh
P
VDD
M6
s
s
VDD
s
M8
Y-enable
62
data out
For refresh, initially R=1, P=1, W=0 and the contents of memory
are stored on the capacitor. R is then set to 0, and W to 1, and the
value is stored back in memory, after being restored in the refresh
circuitry.
63
X-enable
s
M5
Refresh and
control circuitry
data in/out
This memory cell is not only dynamic, but a read destroys the contents of the memory (discharges the capacitor), and the value must
be rewritten. The memory state is determined by the charge on the
capacitor, and this charge is detected by a sense amplifier in the
control circuitry. The amount of charge required to store a value
reliably is important in this type of cell.
64
For the 1-transistor cell, there are several problems; the gate capacitance is too small to store enough charge, and the readout is
destructive. (They are often made with a capacitor constructed over
the top of the transistor, to save area.) Also, the output signal is
usually quite small. (1M bit dynamic RAMs may store a bit using
only 50,000 electrons!) This means that the sense amplifier must
be carefully designed to be sensitive to small charge differences, as
well as to respond quickly to changes.
65
VDD
M2
M5
s
s
M1
s
M4
s
s
M6
M5
M3
M1
M5
s
s
s s
s
M6
M3
s
M5
M6
M3
66
The following slides show some of the ways in which single transistor
memory cells can be reduced in area to provide high storage densities.
(Taken from Advanced Cell Structures for Dynamic RAMS,
IEEE Circuits and Devices, V. 5 No. 1, pp.2736)
The first figure shows a transistor beside a simple two plate capacitor,
with both the capacitor and the transistor fabricated on the plane of
the surface of the silicon substrate:
67
68
69
70
s
s
s
VDD
X1-enable
s
s
s
s
data1
data0
Y1-enable
Y0-enable
71
data0
data1
72
MAIN
MEMORY
CPU
CACHE
DISK
CNTRL
DISK
DISK
73
Cache memory
The cache is a small amount of high-speed memory, usually with a
memory cycle time comparable to the time required by the CPU to
fetch one instruction. The cache is usually filled from main memory
when instructions or data are fetched into the CPU. Often the main
memory will supply a wider data word to the cache than the CPU
requires, to fill the cache more rapidly. The amount of information
which is replaces at one time in the cache is called the line size for the
cache. This is normally the width of the data bus between the cache
memory and the main memory. A wide line size for the cache means
that several instruction or data words are loaded into the cache at
one time, providing a kind of prefetching for instructions or data.
Since the cache is small, the effectiveness of the cache relies on the
following properties of most programs:
Spatial locality most programs are highly sequential; the next
instruction usually comes from the next memory location.
Data is usually structured. Also, several operations are performed on the same data values, or variables.
Temporal locality short loops are a common program structure, especially for the innermost sets of nested loops. This
means that the same small set of instructions is used over and
over, as are many of the data elements.
74
When a cache is used, there must be some way in which the memory
controller determines whether the value currently being addressed in
memory is available from the cache. There are several ways that this
can be accomplished. One possibility is to store both the address and
the value from main memory in the cache, with the address stored in
a type of memory called associative memory or, more descriptively,
content addressable memory.
An associative memory, or content addressable memory, has the
property that when a value is presented to the memory, the address
of the value is returned if the value is stored in the memory, otherwise
an indication that the value is not in the associative memory is returned. All of the comparisons are done simultaneously, so the search
is performed very quickly. This type of memory is very expensive,
because each memory location must have both a comparator and a
storage element. A cache memory can be implemented with a block
of associative memory, together with a block of ordinary memory.
The associative memory holds the address of the data stored in the
cache, and the ordinary memory contains the data at that address.
address (input)
...
...
75
comparator
stored address
ORDINARY
MEMORY
MEMORY
r
r
r
r
r
r
r
r
r
r
r
r
address
data
(input)
(output)
If the address is not found in the associative memory, then the value
is obtained from main memory.
Associative memory is very expensive, because a comparator is required for every word in the memory, to perform all the comparisons
in parallel.
76
A cheaper way to implement a cache memory, without using expensive associative memory, is to use direct mapping. Here, part of
the memory address (the low order digits of the address) is used to
address a word in the cache. This part of the address is called the
index. The remaining high-order bits in the address, called the tag,
are stored in the cache memory along with the data.
For example, if a processor has an 18 bit address for memory, and
a cache of 1 K words of 2 bytes (16 bits) length, and the processor
can address single bytes or 2 byte words, we might have the memory
address field and cache organized as follows:
MEMORY ADDRESS
17
1110
TAG
INDEX
0
1
2
TAG
1 0
INDEX
BYTE 0
BYTE
BYTE 1
q
q
q
q
q
q
q
q
q
1023
Parity Bits
77
Valid Bit
This was, in fact, the way the cache was organized in the PDP-11/60.
In the 11/60, however, there are 4 other bits used to ensure that the
data in the cache is valid. 3 of these are parity bits; one for each byte
and one for the tag. The parity bits are used to check that a single
bit error has not occurred to the data while in the cache. A fourth
bit, called the valid bit is used to indicate whether or not a given
location in cache is valid.
In the PDP-11/60 and in many other processors, the cache is not
updated if memory is altered by a device other than the CPU (for
example when a disk stores new data in memory). When such a
memory operation occurs to a location which has its value stored
in cache, the valid bit is reset to show that the data is stale and
does not correspond to the data in main memory. As well, the valid
bit is reset when power is first applied to the processor or when the
processor recovers from a power failure, because the data found in
the cache at that time will be invalid.
78
In the PDP-11/60, the data path from memory to cache was the
same size (16 bits) as from cache to the CPU. (In the PDP-11/70,
a faster machine, the data path from the CPU to cache was 16 bits,
while from memory to cache was 32 bits which means that the cache
had effectively prefetched the next instruction, approximately half
of the time). The number of consecutive words taken from main
memory into the cache on each memory fetch is called the line size
of the cache. A large line size allows the prefetching of a number
of instructions or data words. All items in a line of the cache are
replaced in the cache simultaneously, however, resulting in a larger
block of data being replaced for each cache miss.
INDEX TAG
0
1
2
3
4
WORD 0
WORD 1
r
r
r
r
r
r
1023
17
Memory address
1211
Tag
2 10
Index
79
Byte in word
Word in line
For a similar 2K word (or 8K byte) cache, the MIPS processor would
typically have a cache configuration as follows:
INDEX
0
1
2
3
4
TAG
WORD 0
WORD 1
1023
Memory address
13 12
31
2 1 0
...
Tag
Index
Byte in word
Word in line
Generally, the MIPS cache would be larger (64Kbytes would be typical, and line sizes of 1, 2 or 4 words would be typical).
80
81
TAG 0
LINE 0
TAG 1
LINE 1
0
1
2
r
r
r
1023
r
r
r
r
r
r
r r
r r
r r
82
r
r
r
r
r
r
In a 2-way set associative cache, if one data line is empty for a read
operation corresponding to a particular index, then it is filled. If both
data lines are filled, then one must be overwritten by the new data.
Similarly, in an n-way set associative cache, if all n data and tag fields
in a set are filled, then one value in the set must be overwritten, or
replaced, in the cache by the new tag and data values. Note that an
entire line must be replaced each time.
83
84
Least recently used (LRU) here the value which was actually used least recently is replaced. In general, it is more likely
that the most recently used value will be the one required in the
near future. For a 2-way set associative cache, this is readily
implemented by setting a special bit called the USED bit for
the other word when a value is accessed while the corresponding
bit for the word which was accessed is reset. The value to be
replaced is then the value with the USED bit set. This replacement strategy can be implemented by adding a single USED bit
to each cache location. The LRU strategy operates by setting a
bit in the other word when a value is stored and resetting the
corresponding bit for the new word. For an n-way set associative
cache, this strategy can be implemented by storing a modulo n
counter with each data word. (It is an interesting exercise to
determine exactly what must be done in this case. The required
circuitry may become somewhat complex, for large n.)
85
87
88
Cache size
1K
2K
4K
8K
Miss Ratio
0.1
16 K
32 K
64 K
128 K
256 K
512 K
1024 K
0.01
0.001
16
32
64
128
256
89
Miss Ratio
0.1
0.01
0.001
1
10
100
1000
90
Miss Ratio
0.1
1K
2K
4K
8K
16 K
32 K
64 K
128 K
256 K
512 K
1024 K
0.01
0.001
1
full
91
Miss Ratio
0.1
0.01
0.001
1
10
100
1000
92
93
Example:
Assume a cache miss rate of 5%, (a hit rate of 95%) with cache
memory of 1ns cycle time, and main memory of 35ns cycle time. We
can calculate the average cycle time as
(1 0.05) 1ns + 0.05 35ns = 2.7ns
The following table shows the effective memory cycle time as a function of cache hit rate for the system in the above example:
Cache hit % Effective cycle time (ns)
80
7.8
85
6.1
90
4.4
95
2.7
98
1.68
99
1.34
100
94
Both the VAX 3500 and the MIPS R2000 processors have interesting cache structures, and were marketed at the same time.
(Interestingly, neither of the parent companies which produced these
processors are now independent companies. Digital Equipment Corporation was acquired by Compaq, which in turn was acquired by
Hewlett Packard. MIPS was acquired by Silicon Graphics Corporation).
The VAX 3500 has two levels of cache memory a 1 Kbyte 2-way
set associative cache is built into the processor chip itself, and there
is an external 64 Kbyte direct mapped cache. The overall cache hit
rate is typically 95 to 99%. If there is an on-chip (first level) cache
hit, the external memory bus is not used by the processor. The first
level cache responds to a read in one machine cycle (90ns), while the
second level cache responds within two cycles. Both caches can be
configured as caches for instructions only, for data only, or for both
instructions and data. In a single processor system, a mixed cache is
typical; in systems with several processors and shared memory, one
way of ensuring data consistency is to cache only instructions (which
are not modified); then all data must come from main memory, and
consequently whenever a processor reads a data word, it gets the
current value.
95
See C. J. DeVane, Design of the MicroVAX 3500/3600 Second Level Cache in the Digital Technical
Journal, No. 7, pp. 87 94 for a discussion of the performance of this cache.
96
The MIPS R2000 has no on-chip cache, but it has provision for the
addition for up to 64 Kbytes of instruction cache and 64 Kbytes
of data cache. Both caches are direct mapped. Separation of the
instruction and data caches is becoming more common in processor
systems, especially for direct mapped caches. In general, instructions
tend to be clustered in memory, and data also tend to be clustered,
so having separate caches reduces cache conflicts. This is particularly
important for direct mapped caches. Also, instruction caches do not
need any provision for writing information back into memory.
Both processors employ a write-through policy for memory writes,
and both provide some buffering between the cache and memory,
so processing can continue during memory writes. The VAX 3500
provides a quadword buffer, while the buffer for the MIPS R2000
depends on the particular system in which it is used. A small write
buffer is normally adequate, however, since writes are relatively much
less frequent than reads.
97
98
99
The following two results (see High Performance Computer Architecture by H.S. Stone, Addison Wesley, Chapter 2, Section 2.2.2, pp.
5770) derived by Puzak, in his Ph.D. thesis (T.R. Puzak, Cache
Memory Design, University of Massachusetts, 1985) can be used to
reduce the size of the traces and still result in realistic simulations.
The first trace reduction, or trace stripping, technique assumes that
a series of caches of related sizes starting with a cache of size N, all
with the same line size, are to be simulated with some cache trace.
The cache trace is reduced by retaining only those memory references
which result in a cache miss for a direct mapped cache.
Note that, for a miss rate of 10%, 90% of the memory trace would
be discarded. Lower miss rates result in higher reductions.
The reduced trace will produce the same number of cache misses as
the original trace for:
A K-way set associative cache with N sets and line size L
A one-way set associative cache with 2N sets and line size L
(provided that N is a power of 2)
In other words, for caches with size some power of 2, it is possible
to investigate caches of with sizes a multiple of the initial cache size,
and with arbitrary set associativity using the same reduced trace.
100
101
102
Interleaved memory
In large computer systems, it is common to have several sets of data
and address lines connected to independent banks of memory, arranged so that adjacent memory words reside in different memory
banks. Such memory system are called interleaved memories, and
allow simultaneous, or time overlapped, access to adjacent memory
locations. Memory may be n-way interleaved, where n is usually a
power of two. 2, 4 and 8-way interleaving is common in large mainframes. In such systems, the cache size typically would be sufficient
to contain a data word from each bank. The following diagram shows
an example of a 4-way interleaved memory.
C
P
U
bank 3
bank 2
bank 1
bank 0
C
P
U
bank 3
bank 2
bank 1
bank 0
Memory bus
(a)
(b)
103
104
Here we can have two cases; case (a), where the execution time is
less than the full time for an operand fetch, and case (b) where the
execution time is greater than the time for an operand fetch. The
following figures (a) and (b) show cases (a) and (b) respectively,
I-fetch
ta
Operand fetch
ts
ta
td
ts
ta
s s s
tea
(a) tea ts
I-fetch
ta
I-fetch
tia = 2tc
Operand fetch
ts
ta
ts
td
s s s
I-fetch
ta
teb
(b) teb ts
105
I-fetch
ta
ts
td
s s s
Operand fetch
ta
ts
tea
Note that this example assumes that there is no conflict the instruction and its operand are in separate memory banks. For this
example, the instruction execution time is
ti = 2ta + td + te
If ta ts and te is small, then ti(interleaved) 12 ti(non interleaved).
106
107
N
X
K(1 )K1
K=1
1
= [1 (1 )N ]
where N is the number of interleaved memory banks. IF is, effectively, the number of memory banks being used.
Example:
If N = 4, and = 0.1 then
IF = 1/0.1(1 (1 0.1)4)
= 10(1 0.94)
= 3.4
For operands, a simple (but rather pessimistic) thing is to assume
that the data is randomly distributed in the memory banks. In this
case, the probability Q(K) of a string of length K is:
K
(N 1)!K
N N 1 N 2
=
N
N
N
N
(N K)!N K
and the average number of operand fetches is
Q(k) =
OF =
N
X
K=1
1
108
(N 1)!K
(N K)!N K
109
110
file1
file2
sort
Here there are two processes, cat and sort, with their data specified.
When this command is executed, the processes cat and sort are
particular instances of the programs cat and sort.
111
Note that these two processes can exist in at least 3 states: active,
or running; ready to run, but temporarily stopped because the other
process is running; or blocked waiting for data from another process.
active
blocked
ready
112
Following is a simplified process state diagram for the UNIX operating system:
user
running
sys call
or interrupt
death
return
interrupt return
kill
sleep
asleep
kernel
running
birth
schedule
process
start process
ready
wakeup
113
114
115
Hardware
116
117
One of the most fundamental resources to be allocated among processes (in a single CPU system) is the main memory.
A number of allocation strategies are possible:
(1) single storage allocation here all of main memory (except
for space for the operating system nucleus, or kernel) is given to
the current process.
Two problems can arise here:
1. the process may require less memory than is available (wasteful
of memory)
2. the process may require more memory than is available.
The second is a serious problem, which can be addressed in several
ways. The simplest of these is by static overlay, where a block of
data or code not currently required is overwritten in memory by the
required code or data.
This was originally done by the programmer, who embedded commands to load the appropriate blocks of code or data directly in the
source code.
Later, loaders were available which analyzed the code and data blocks
and loaded the appropriate blocks when required.
This type of memory management is still used in primitive operating
systems (e.g., DOS).
118
16k
16k
2
8k
12k
14k
32k
20k
16k
48k
7
64k
12k
12k
9
20k
80k
Clearly, segments at the same level in the tree need not be memory
resident at the same time. e.g., in the above example, it would be
appropriate to have segments (1,3,9) and (5,7) in memory simultaneously, but not, say, (2,3).
119
40k
Kernel
35k
Job 1
50k
75k
Job 2
waste
43k
Job 3
waste
68k
This system did not offer a very efficient use of memory; the systems
manager had to determine an appropriate memory partition, which
was then fixed. This limited the number of processes, and the mix
of processes which could be run at any given time.
Also, in this type of system, dynamic data structures pose difficulties.
120
40k Kernel
40k Kernel
40k Kernel
20k
Job 1
20k
Job 1
20k
Job 1
20k
Free
50k
Job 2
80k
Free
65k
Job 5
65k
Job 5
30k
Free
15k
Free
15k
Free
40k
Job 4
40k
Job 4
40k
Job 4
40k
Job 4
20k
Job 3
20k
Job 3
20k
Job 3
20k
Job 3
Here, dynamic data structures are still a problem jobs are placed
in areas where they fit at the time of loading.
A new problem here is memory fragmentation it is usually
much easier to find a block of memory for a small job than for a large
job. Eventually, memory may contain many small jobs, separated by
holes too small for any of the queued processes.
This effect may seriously reduce the chances of running a large job.
121
Kernel
20k
Free
40k
Kernel
65k
Job 5
35k
Free
65k
Job 5
15k
Free
40k
Job 4
40k
Job 4
20k
Job 3
20k
Job 3
In this system, the whole program must be moved, which may have a
penalty in execution time. This is a tradeoff how frequently memory should be compacted against the performance lost to memory
fragmentation.
Again, dynamic memory allocation is still difficult, but less so than
for the other systems.
122
123
124
The process of translating, or mapping, a virtual address into a physical address is called virtual address translation. The following
diagram shows the relationship between a named variable and its
physical location in the system.
Logical
name
Name space
Virtual
address
Physical
address
125
126
offset
Page
Map
Physical page
offset
number
Base address of Page
(physical memory)
Note that whole page blocks in virtual memory are mapped to whole
page blocks in physical memory.
This means that the page offset is part of both the virtual and physical address.
127
Requiring two memory fetches for each instruction is a large performance penalty, so most virtual addressing systems have a small
associative memory (called a translation lookaside buffer, or TLB)
which contains the last few virtual addresses and their corresponding physical addresses. Then for most cases the virtual to physical
mapping does not require an additional memory access. The following diagram shows a typical virtual-to-physical address mapping in
a system containing a TLB:
Virtual address
Virtual page number
offset
Page hit
in TLB
TLB
Page
Map
Physical page
offset
number
Base address of Page
(physical memory)
128
129
130
Per process
Virtual address
Virtual to
Physical Address
Translation
Physical address
Note that not all the virtual address blocks are in the physical memory at the same time. Furthermore, adjacent blocks in virtual memory are not necessarily adjacent in physical memory.
If a block is moved out of physical memory and later replaced, it may
not be at the same physical address.
The translation process must be fast, most of the time.
131
132
133
134
The following is an example of a paged memory management configuration using a fully associative page translation table:
Consider a computer system which has 16 M bytes (224 bytes) of main
memory, and a virtual memory space of 232 bytes. The following
diagram sketches the page translation table required to manage all
of main memory if the page size is 4K (212) bytes. Note that the
associative memory is 20 bits wide ( 32 bits 12 bits, the virtual
address size the page size). Also to manage 16 M bytes of memory
with a page size of 4 K bytes, a total of (16M )/(4K) = 212 = 4096
associative memory locations are required.
VIRTUAL ADDRESS
1211
31
PHYSICAL 0
1
PAGE
ADDRESS 2
3
4r
31
0
BYTE IN PAGE
12
r
r qq
q
q
q
q
q
q
4095
ASSOCIATIVE MEMORY
135
Some other attributes are usually included in a page translation table, as well, by adding extra fields to the table. For example, pages
or segments may be characterized as read only, read-write, etc. As
well, it is common to include information about access privileges, to
help ensure that one program does not inadvertently corrupt data for
another program. It is also usual to have a bit (the dirty bit) which
indicates whether or not a page has been written to, so that the page
will be written back onto the disk if a memory write has occurred
into that page. (This is done only when the page is swapped,
because disk access times are too long to permit a write-through
policy like cache memory.) Also, since associative memory is very expensive, it is not usual to map all of main memory using associative
memory; it is more usual to have a small amount of associative memory which contains the physical addresses of recently accessed pages,
and maintain a virtual address translation table in main memory
for the remaining pages in physical memory. A virtual to physical
address translation can normally be done within one memory cycle
if the virtual address is contained in the associative memory; if the
address must be recovered from the virtual address translation table in main memory, at least one more memory cycle must be used
to retrieve the physical address from main memory.
136
There is a kind of trade-off between the page size for a system and
the size of the page translation table (PTT). If a processor has a
small page size, then the PTT must be quite large to map all of
the virtual memory space. For example, if a processor has a 32 bit
virtual memory address, and a page size of 512 bytes (29 bytes), then
there are 223 possible page table entries. If the page size is increased
to 4 Kbytes (212 bytes), then the PTT requires only 220, or 1 M
page table entries. These large page tables will normally not be very
full, since the number of entries is limited to the amount of physical
memory available.
One way these large, sparse PTTs are managed is by mapping the
PTT itself into virtual memory. (Of course, the pages which map
the virtual PTT must not be mapped out of the physical memory!)
There are also other pages that should not be mapped out of physical
memory. For example, pages mapping to I/O buffers. Even the I/O
devices themselves are normally mapped to some part of the physical
address space.
137
Note that both paged and segmented memory management provide the users of a computer system with all the advantages of a
large virtual address space. The principal advantage of the paged
memory management system over the segmented memory management system is that the memory controller required to implement a
paged memory management system is considerably simpler. Also,
the paged memory management does not suffer from fragmentation
in the same way as segmented memory management. Another kind
of fragmentation does occur, however. A whole page is swapped in or
out of memory, even if it is not full of data or instructions. Here the
fragmentation is within a page, and it does not persist in the main
memory when new pages are swapped in.
One problem found in virtual memory systems, particularly paged
memory systems, is that when there are a large number of processes
executing simultaneously as in a multiuser system, the main memory may contain only a few pages for each process, and all processes
may have only enough code and data in main memory to execute for
a very short time before a page fault occurs. This situation, often
called thrashing, severely degrades the throughput of the processor because it actually must spend time waiting for information to
be read from or written to the disk.
138
139
MIPS R2000
The MIPS R2000 has a 4 Kbyte page size, and 64 entries in its fully
associative TLB, which can perform two translations in each machine
cycle one for the instruction to be fetched and one for the data
to be fetched or stored (for the LOAD and STORE instructions).
Unlike the VAX 3500 (and most other processors, including other
RISC processors), the MIPS R2000 does not handle TLB misses
using hardware. Rather, an exception (the TLB miss exception) is
generated, and the address translation is handled in software. In fact,
even the replacement of the entry in the TLB is handled in software.
Usually, the replacement algorithm chosen is random replacement,
because the processor generates a random number between 8 and 63
for this purpose. (The lowest 8 TLB locations are normally reserved
for the kernel; e.g., to refer to such things as the current PTT).
This is another example of the MIPS designers making a tradeoff
providing a larger TLB, thus reducing the frequency of TLB misses
at the expense of handling those misses in software, much as if they
were page faults.
A page fault, however, would cause the current process to be stopped
and another to be started, so the cost in time would be much higher
than a mere TLB miss.
140
141
FIFO
CLOCK
LRU
OPT
8
6
4
2
6
8 10 12
Pages allocated
142
14
143
page faults
(x 1000)
10
FIFO
CLOCK
LRU
OPT
8
6
4
2
8
16
1024
512
Number of pages (fixed 8K memory)
32
256
Note that, when the page size is sufficiently small, the performance
degrades. In this (small) example, the small number of pages loaded
in memory degrade the performance severely for the largest page size
(2K bytes, corresponding to only 4 pages in memory.) Performance
improves with increased number of pages (of smaller size) in memory,
until the page size becomes small enough that a page doesnt hold
an entire logical block of code.
144
time
145
146
Example:
Given a program with 7 virtual pages {a,b,. . . ,g} and the reference
sequence
a
acgf
2 ab
acgf
3 ab
10 acgf
4 abc
11 acgf
5 abcg
12 agfd
6 acg
13 afdb
7 acgf
14 fdbg
147
148
Page
faults
LRU
WS
memory
Page fault frequency replacement
This is another method for varying the amount of physical memory
available to a process. It is based on the simple observation that,
when the frequency of page faults for a process increases above some
threshold, then more memory should be allocated to the process.
The page fault frequency (PFF) can be approximated by 1/(time
between page faults), although a better estimate can be gotten by
averaging over a few page faults.
A PFF implementation must both increase the number of pages if the
PFF is higher than some threshold, and must also lower the number
of pages in some way. A reasonable policy might be the following:
149
Increase the number of pages allocated to the process by 1 whenever the PFF is greater than some threshold Th .
Decrease the number of pages allocated to the process by 1 whenever the PFF is less than some threshold Tl .
If Tl < PFF < Th, then replace a page in memory by some other
reasonable policy; e.g., LRU.
The thresholds Th and Tl should be system parameters, depending
on the amount of physical memory available.
An alternative policy for decreasing the number of pages allocated
to a process might be to decrease the number of pages allocated to
a process when the PFF does not exceed T for some period of time.
Note that in all the preceding we have implicitly assumed that pages
will be loaded on demand this is called demand paging. It is
also possible to attempt to predict what pages will be required in
the future, and preload the pages in anticipation of their use. The
penalty for a bad guess is high, however, since part of memory will
be filled with useless information. Some systems do use preloading
algorithms, but most present systems rely on demand paging.
150
unused
63
page map
page page
level 4 level 3 directory table
48 47
39 38
30 29
21 20
12 11
offset
0
The 12 bit offset specifies the byte in a 4KB page. The 9 bit (512
entry) page table points to the specific page, while the three higher
level (9 bit, 512 entry) tables are used to point eventually to the page
table.
The page table itself maps 512 4KB pages, or 2MB of memory.
Adding one more level increases this by another factor of 512, for
1GB of memory, and so on.
Clearly, most programs do not use anywhere near all the available
virtual memory, so the page tables higher level page maps are very
sparse.
Both Windows 7/8 and Linux use a page size of 4KB, although Linux
also supports a 2MB page size for some applications.
151
152
outer
page
31
inner
page
22 21
offset
12 11
4KB
page
The 10 bit (1K entry) outer page table points to an inner page
table of the same size. The inner page table contains the mapping for the virtual page in physical memory.
Again, Linux on the ARM architecture uses 4KB pages, as do the
other operating systems commonly running on the ARM.
Different ARM implementations have different size TLBs, implemented in the hardware. Of course, the page table mapping is used
only on a TLB miss.
153
libraries
traps, interrupts
traps
User
Kernel
...............................................................................................................................................................................................................................................................................................................................................................
file
subsystem
process
...
...
..................................................................................................
...
...
...
...
....
....
.
...
....................................................................................................
inter-process
communication
...
....
control ...................................................................................................
...
...
...
scheduler
...
....
subsystem ....
..
....................................................................................................
...
....................................................................................................
...
...
...
....
..
..
...
.......................................................................................................
buffer cache
memory
management
char
block
hardware control
Kernel
Hardware
...............................................................................................................................................................................................................................................................................................................................................................
the computer
154
155
..
...
.
.
.
...
.
.
..
...
...
..
user
running
sys call
or interrupt
zombie
return
interrupt return
exit
sleep
asleep, in
memory
swap out
return to user
asleep,
swapped
kernel
running
preempted
preempt
schedule
process
ready, in
wakeup
memory
swap out
wakeup
swap in
ready,
swapped
156
enough mem
created
fork
In the operating system, each process is represented by its own process control block (sometimes called a task control block, or job
control block). This process control block is a data structure (or set
of structures) which contains information about the process. This
information includes everything required to continue the process if it
is blocked for any reason (or if it is interrupted). Typical information
would include:
the process state ready, running, blocked, etc.
the values of the program counter, stack pointer and other internal registers
process scheduling information, including the priority of the process, its elapsed time, etc.
memory management information
I/O information: status of I/O devices, queues, etc.
accounting information; e.g. CPU and real time, amount of disk
used, amount of I/O generated, etc.
157
In many systems, this space for these process control blocks is allocated (in system space memory) when the system is generated, and
places a firm limit on the number of processes which can be allocated
at one time. (The simplicity of this allocation makes it attractive,
even though it may waste part of system memory by having blocks
allocated which are rarely used.)
Following is a diagram of the process management data structures in
a typical UNIX system:
per process
region table
region table
u area
Text
Stack
process table
main memory
158
Process scheduling:
Although in a modern multi-tasking system, each process can make
use of the full resources of the virtual machine while actually sharing these resources with other processes, the perceived use of these
resources may depend considerably on the way in which the various
processes are given access to the processor. We will now look at
some of the things which may be important when processes are to
be scheduled.
We can think of the scheduler as the algorithm which determines
which virtual machine is currently mapped onto the physical machine.
Actually, two types of scheduling are required; a long term scheduler which determines which processes are to be loaded into memory, and a short term scheduler which determines which of the
processes in memory will actually be running at any given time. The
short-term scheduler is also called the dispatcher.
Most scheduling algorithms deal with one or more queues of processes; each process in each queue is assigned a priority in some way,
and the process with the highest priority is the process chosen to run
next.
159
160
161
Replacement strategies
There are a small number of commonly used replacement strategies
for a block:
random replacement
first-in first-out (FIFO)
first-in not used first-out (clock)
Least recently used (LRU)
Speculation (prepaging, preloading)
Writing
There are two basic strategies for writing data from one level of the
hierarchy to the other:
Write through both levels are consistent, or coherent.
Write back only the highest level has the correct value, and it
is written back to the next level on replacement. This implies that
there is a way of indicating that the block has been written into (e.g.,
with a used bit.)
162
163
164
We will use the ATMEL AVR series of processors as example input/output processors, or controllers for I/O devices.
These 8-bit processors, and others like them (PIC microcontrollers,
8051s, etc.) are perhaps the most common processors in use today.
Frequently, they are not used as individually packaged processors,
but as part of embedded systems, particularly as controllers for other
components in a larger integrated system (e.g., mobile phones).
There are also 16-, 32- and 64-bit processors in the embedded systems
market; the MIPS processor family is commonly used in the 32-bit
market, as is the ARM processor. (The ARM processor is universal
in the mobile telephone market.)
We will look at the internal architecture of the ATMEL AVR series
of 8-bit microprocessors.
They are available as single chip devices in package sizes from 8 pins
(external connections) to 100 pins, and with program memory from
1 to 256 Kbytes.
165
AVR architecture
Internally, the AVR microcontrollers have:
32 8-bit registers, r0 to r31
16 bit instruction word
a minimum 16-bit program counter (PC)
separate instruction and data memory
64 registers dedicated to I/O and control
externally interruptible, interrupt source is programmable
most instructions execute in one cycle
The top 6 registers can be paired as address pointers to data memory.
The X-register is the pair (r26, r27),
the Y-register is the pair (r28, r29), and
the Z-register is the pair (r30, r31).
The Z-register can also point to program memory.
Generally, only the top 16 registers (r16 to r31) can be targets for
the immediate instructions.
166
The program memory is flash programmable, and is fixed until overwritten with a programmer. It is guaranteed to survive at least 10,000
rewrites.
In many processors, self programming is possible. A bootloader can
be stored in protected memory, and a new program downloaded by
a simple serial interface.
In some older small devices, there is no data memory programs use
registers only. In those processors, there is a small stack in hardware
(3 entries). In the processors with data memory, the stack is located
in the data memory.
The size of on-chip data memory (SRAM static memory) varies
from 0 to 16 Kbytes.
Most processors also have EEPROM memory, from 0 to 4 Kbytes.
The C compiler can only be used for devices with SRAM data memory.
Only a few of the older tiny ATMEL devices do not have SRAM
memory.
167
Program
counter
Stack
pointer
Program
flash
SRAM or
hardware stack
Instruction
register
General
purpose
registers
X
Instruction
decoder
Control
lines
ALU
Status
register
Note that the datapath is 8 bits, and the ALU accepts two independent operands from the register file.
Note also the status register, (SREG) which holds information about
the state of the processor; e.g., if the result of a comparison was 0.
168
Onchip peripherals
Program
counter
Stack
pointer
Internal
oscillator
Program
flash
SRAM or
hardware stack
Watchdog
timer
Instruction
register
General
purpose
registers
MCU control
register
X
Instruction
decoder
Control
lines
ALU
Interrupt
unit
Status
register
Data
EEprom
Port X data
direction register
ADC
Timers
Port X
data register
Port X drivers
Analog comparator
Px7
Px0
169
Timing and
control
Application
program
section
32 general
purpose registers
64 I/O registers
0X0000
0X001F
0X005F
Internal
SRAM
Boot loader
section
0X1FFF
0X04FF
Note that the general purpose registers and I/O registers are mapped
into the data memory.
Although most (all but 2) AVR processors have EEPROM memory,
this memory is accessed through I/O registers.
170
Many immediate instructions, like this one, can only use registers 16
to 31, so be careful with those instructions.
There is no add immediate instruction.
There is an add word immediate instruction which operates on the
pointer registers as 16 bit entities in 2 cycles. The maximum constant
which can be added is 63.
There are also logical and logical immediate instructions.
171
These instructions can also post-increment or pre-decrement the index register. E.g.,
ld r19, X+ ; load register 19 with the value pointed
; to by the index register X (r26, r27)
; and add 1 to the register pair
ld r19, -Y ; subtract 1 from the index reg. Y (r28, r29)
; then load register 19 with the value pointed
; to by the decremented value of Y
There is also a load immediate (ldi) which can only operate on
registers r16 to r31.
ldi r17, 14
There are also push and pop instructions which push a byte onto,
or pop a byte off, the stack. (The stack pointer is in the I/O space,
registers 0X3D, OX3E).
172
-14
191
The relative call (rcall) instruction is similar, but places the return
address (PC + 1) on the stack.
The return instruction (ret) returns from a function call by replacing
the PC with the value on the stack.
There are also instructions which skip over the next instruction on
some condition. For example, the instruction skip on register bit
set (SRBS) skips the next instruction (increments PC by 2 or 3) if a
particular bit in the designated register is set.
sbrs
r1, 4
173
174
175
#include <m168def.inc>
.org 0
; define interrupt vectors
vects:
rjmp
reset
reset:
ldi R16, 0b00100000
ser R16
177
LOOP:
sbic PINB, 4
rjmp LOOP
; repeat test
cbi PORTB, 5
SPIN1:
subi R16, 1
brne SPIN1
sbi PORTB, 5
SPIN2:
subi R16, 1
brne SPIN2
rjmp LOOP
178
= 0B00100000;
PORTB = 0B11111111;
while (1) {
while(!(PINB&0B00010000))
PORTB = 0B00100000;
_delay_loop_1(128);
PORTB = 0;
_delay_loop_1(128);
}
}
return(1);
}
Two words about mechanical switches they bounce! That is, they
make and break contact several times in the few microseconds before
full contact is made or broken. This means that a single switch
operation may be seen as several switch actions.
The way this is normally handled is to read the value at a switch (in
a loop) several times over a short period, and report a stable value.
179
Event
0X0000 RESET
power on or reset
0X0002 INT0
0X0004 INT1
0X0006 PCINT0
0X0008 PCINT1
0X000A PCINT2
0X000C WDT
Watchdog Timer
181
sbi EIMSK, 7
There is also a register associated with the particular pins for the
PCINT interrupts. They are the Pin Change Mask Registers (PCMSK1
and PCMSK0).
We want to enable the input connected to the switch, at PINB[4],
which is pcint12, and therefore set bit 4 of PCMSK1, leaving the
other bits unchanged.
Normally, this would be possible with the code
sbi PCMSK1, 4
Unfortunately, this register is one of the extended I/O registers, and
must be written to as a memory location.
The following code sets up its address in register pair Y, reads the
current value in PCMSK1, sets bit 4 to 1, and rewrites the value in
memory.
ldi r28, PCMSK1
clr r29
ld r16, Y
sbr r16,0b00010000
st Y, r16
Now, the appropriate interrupt vectors must be set, as in the table shown earlier, and interrupts enabled globally by setting the
interrupt (I) flag in the status register (SREG).
This later operation is performed with the instruction sei.
The interrupt vector table should look as follows:
.org 0
vects:
jmp
RESET
jmp
EXT_INT0
jmp
EXT_INT1
jmp
PCINT0
jmp
PCINT1
jmp
PCINT2
The next thing necessary is to set the stack pointer to a high memory
address, since interrupts push values on the stack:
ldi r16, 0xff
#include <m168def.inc>
.org 0
VECTS:
jmp
RESET
jmp
EXT_INT0
jmp
EXT_INT1
jmp
PCINT_0
jmp
BUTTON
jmp
PCINT_2
jmp
WDT
;
;
;
;
;
;
;
vector
vector
vector
vector
vector
vector
vector
for
for
for
for
for
for
for
reset
int0
int1
pcint0
pcint1
pcint2
watchdog timer
;
;
;
;
;
;
sbi EIMSK, 7
sbi EIFR, 7
ldi
out
ldi
out
EXT_INT0:
EXT_INT1:
PCINT_0:
PCINT_2:
WDT:
reti
RESET:
r16,
SPL,
r16,
SPH,
0xff
r16
0x04
r16
185
sei
; enable interrupts
BUTTON:
reti
rjmp LOOP
SNOOZE:
sleep
LOOP:
sbic PINB, 4
rjmp SNOOZE
cbi PORTB, 5
ldi R16, 128
SPIN1:
subi R16, 1
brne SPIN1
sbi PORTB, 5
ldi R16, 128
SPIN2:
subi R16, 1
brne SPIN2
rjmp LOOP
186
Input-Output Architecture
In our discussion of the memory hierarchy, it was implicitly assumed
that memory in the computer system would be fast enough to
match the speed of the processor (at least for the highest elements
in the memory hierarchy) and that no special consideration need be
given about how long it would take for a word to be transferred from
memory to the processor an address would be generated by the
processor, and after some fixed time interval, the memory system
would provide the required information. (In the case of a cache miss,
the time interval would be longer, but generally still fixed. For a
page fault, the processor would be interrupted; and the page fault
handling software invoked.)
Although input-output devices are mapped to appear like memory
devices in many computer systems, I/O devices have characteristics
quite different from memory devices, and often pose special problems
for computer systems. This is principally for two reasons:
I/O devices span a wide range of speeds. (e.g. terminals accepting input at a few characters per second; disks reading data at
over 10 million characters / second).
Unlike memory operations, I/O operations and the CPU are not
generally synchronized with each other.
187
I/O devices also have other characteristics; for example, the amount
of data required for a particular operation. For example, a keyboard
inputs a single character at a time, while a color display may use
several Mbytes of data at a time.
The following lists several I/O devices and some of their typical properties:
Device
keyboard
0.001
0.01
human/machine
mouse
0.001
0.1
human/machine
voice input
human/machine
laser printer
1 1000+
1000
machine/human
100,000+
machine/human
magnetic disk
4 4000
100,000+
system
CD/DVD
1000
system
LAN
100,000+
system/system
188
The following figure shows the general I/O structure associated with
many medium-scale processors. Note that the I/O controllers and
main memory are connected to the main system bus. The cache
memory (usually found on-chip with the CPU) has a direct connection to the processor, as well as to the system bus.
CPU
.
....
Interrupts
and control
Cache
.....
..
....
...........
System bus
I/O controller
I/O controller
Main
Memory
.....
..
....
I/O devices
Note that the I/O devices shown here are not connected directly
to the system bus, they interface with another device called an I/O
controller.
189
In simpler systems, the CPU may also serve as the I/O controller,
but in systems where throughput and performance are important,
I/O operations are generally handled outside the processor.
In higher performance processors (desktop and workstation systems)
there may be several separate I/O buses. The PC today has separate
buses for memory (the FSB, or front-side bus), for graphics (the AGP
bus or PCIe/16 bus), and for I/O devices (the PCI or PCIe bus).
It has one or more high-speed serial ports (USB or Firewire), and
100 Mbit/s or 1 Gbit/s network ports as well. (The PCIe bus is also
serial.)
It may also support several legacy I/O systems, including serial
(RS-232) and parallel (printer) ports.
Until relatively recently, the I/O performance of a system was somewhat of an afterthought for systems designers. The reduced cost of
high-performance disks, permitting the proliferation of virtual memory systems, and the dramatic reduction in the cost of high-quality
video display devices, have meant that designers must pay much
more attention to this aspect to ensure adequate performance in the
overall system.
Because of the different speeds and data requirements of I/O devices,
different I/O strategies may be useful, depending on the type of I/O
device which is connected to the computer. We will look at several
different I/O strategies later.
190
191
Note that each of the talker and listener supply two signals. The
talker supplies a signal (say, data valid, or DAV ) at step (1). It
supplies another signal (say, data not valid, or DAV ) at step (3).
Both these signals can be coded as a single binary value (DAV )
which takes the value 1 at step (1) and 0 at step (3). The listener
supplies a signal (say, data accepted, or DAC) at step (2). It supplies
a signal (say, data not now accepted, or DAC) at step (4). It, too,
can be coded as a single binary variable, DAC. Because only two
binary variables are required, the handshaking information can be
communicated over two wires, and the form of handshaking described
above is called a two wire handshake.
The following figure shows a timing diagram for the signals DAV
and DAC which illustrates the timing of these four events:
(1)
(3)- - - 1
DAV
(2)
DAC
0
(4)
- - 1
0
192
As stated earlier, either the CPU or the I/O device can act as the
talker or the listener. In fact, the CPU may act as a talker at one
time and a listener at another. For example, when communicating
with a terminal screen (an output device) the CPU acts as a talker,
but when communicating with a terminal keyboard (an input device)
the CPU acts as a listener.
This is about the simplest synchronization which can guarantee reliable communication between two devices. It may be inadequate
where there are more than two devices.
Other forms of handshaking are used in more complex situations; for
example, where there may be more than one controller on the bus,
or where the communication is among several devices.
For example, there is also a similar, but more complex, 3-wire handshake which is useful for communicating among more than two devices.
193
194
Program-controlled I/O
One common I/O strategy is program-controlled I/O, (often called
polled I/O). Here all I/O is performed under control of an I/O handling procedure, and input or output is initiated by this procedure.
The I/O handling procedure will require some status information
(handshaking information) from the I/O device (e.g., whether the
device is ready to receive data). This information is usually obtained
through a second input from the device; a single bit is usually sufficient, so one input port can be used to collect status, or handshake,
information from several I/O devices. (A port is the name given to a
connection to an I/O device; e.g., to the memory location into which
an I/O device is mapped). An I/O port is usually implemented as
a register (possibly a set of D flip flops) which also acts as a buffer
between the CPU and the actual I/O device. The word port is often
used to refer to the buffer itself.
Typically, there will be several I/O devices connected to the processor; the processor checks the status input port periodically, under
program control by the I/O handling procedure. If an I/O device
requires service, it will signal this need by altering its input to the
status port. When the I/O control program detects that this has
occurred (by reading the status port) then the appropriate operation
will be performed on the I/O device which requested the service.
195
DEVICE 1
TO PORT 2
DEVICE 2
TO PORT 3
TO PORT N
196
DEVICE 3
q
q
q
DEVICE N
197
198
#include <m168def.inc>
.org 0
; define interrupt vectors
vects:
rjmp
reset
reset:
ldi R16, 0b00100000
ser R16
200
LOOP:
sbic PINB, 4
rjmp LOOP
; repeat test
cbi PORTB, 5
SPIN1:
subi R16, 1
brne SPIN1
sbi PORTB, 5
SPIN2:
subi R16, 1
brne SPIN2
rjmp LOOP
201
= 0B00100000;
PORTB = 0B11111111;
while (1) {
while(!(PINB&0B00010000))
PORTB = 0B00100000;
_delay_loop_1(128);
PORTB = 0;
_delay_loop_1(128);
}
}
return(1);
}
Two words about mechanical switches they bounce! That is, they
make and break contact several times in the few microseconds before
full contact is made or broken. This means that a single switch
operation may be seen as several switch actions.
The way this is normally handled is to read the value at a switch (in
a loop) several times over a short period, and report a stable value.
202
Interrupt-controlled I/O
Interrupt-controlled I/O reduces the severity of the two problems
mentioned for program-controlled I/O by allowing the I/O device
itself to initiate the device service routine in the processor. This is
accomplished by having the I/O device generate an interrupt signal
which is tested directly by the hardware of the CPU. When the interrupt input to the CPU is found to be active, the CPU itself initiates
a subprogram call to somewhere in the memory of the processor; the
particular address to which the processor branches on an interrupt
depends on the interrupt facilities available in the processor.
The simplest type of interrupt facility is where the processor executes
a subprogram branch to some specific address whenever an interrupt
input is detected by the CPU. The return address (the location of
the next instruction in the program that was interrupted) is saved
by the processor as part of the interrupt process.
If there are several devices which are capable of interrupting the processor, then with this simple interrupt scheme the interrupt handling
routine must examine each device to determine which one caused
the interrupt. Also, since only one interrupt can be handled at a
time, there is usually a hardware priority encoder which allows the
device with the highest priority to interrupt the processor, if several
devices attempt to interrupt the processor simultaneously.
203
In the previous figure, the handshake out outputs would be connected to a priority encoder to implement this type of I/O. the other
connections remain the same. (Some systems use a daisy chain
priority system to determine which of the interrupting devices is serviced first. Daisy chain priority resolution is discussed later.)
HANDSHAKE IN
TO PORT 1
DEVICE 1
TO PORT 2
DEVICE 2
TO PORT 3
TO PORT N
204
DEVICE 3
q
q
q
DEVICE N
205
Vectored interrupts
Many computers make use of vectored interrupts. With vectored
interrupts, it is the responsibility of the interrupting device to provide the address in main memory of the interrupt servicing routine
for that device. This means, of course, that the I/O device itself
must have sufficient intelligence to provide this address when requested by the CPU, and also to be initially programmed with
this address information by the processor. Although somewhat more
complex than the simple interrupt system described earlier, vectored
interrupts provide such a significant advantage in interrupt handling
speed and ease of implementation (i.e., a separate routine for each
device) that this method is almost universally used on modern computer systems.
Some processors have a number of special inputs for vectored interrupts (each acting much like the simple interrupt described earlier).
Others require that the interrupting device itself provide the interrupt address as part of the process of interrupting the processor.
206
Event
0X000
RESET
power on or reset
0X002
INT0
0X004
INT1
0X006
0X008
0X00A
0X00C
WDT
207
208
sbi EIMSK, 7
There is also a register associated with the particular pins for the
PCINT interrupts. They are the Pin Change Mask Registers (PCMSK1
and PCMSK0).
We want to enable the input connected to the switch, at PINB[4],
which is pcint12, and therefore set bit 4 of PCMSK1, leaving the
other bits unchanged.
Normally, this would be possible with the code
sbi PCMSK1, 4
Unfortunately, this register is one of the extended I/O registers, and
must be written to as a memory location.
The following code sets up its address in register pair Y, reads the
current value in PCMSK1, sets bit 4 to 1, and rewrites the value in
memory.
ldi r28, PCMSK1
clr r29
ld r16, Y
sbr r16,0b00010000
st Y, r16
Now, the appropriate interrupt vectors must be set, as in the table shown earlier, and interrupts enabled globally by setting the
interrupt (I) flag in the status register (SREG).
This later operation is performed with the instruction sei.
The interrupt vector table should look as follows:
.org 0
vects:
jmp
RESET
jmp
EXT_INT0
jmp
PCINT0
jmp
PCINT1
jmp
TIM2_COMP
The next thing necessary is to set the stack pointer to a high memory
address, since interrupts push values on the stack:
ldi r16, 0xff
211
#include <m168def.inc>
.org 0
VECTS:
jmp
RESET
jmp
EXT_INT0
jmp
EXT_INT1
jmp
PCINT_0
jmp
BUTTON
jmp
PCINT_2
;
;
;
;
;
;
vector
vector
vector
vector
vector
vector
for
for
for
for
for
for
reset
int0
int1
pcint0
pcint1
pcint2
;
;
;
;
;
;
sbi EIMSK, 7
sbi EIFR, 7
ldi
out
ldi
out
r16,
SPL,
r16,
SPH,
ldi
out
ser
out
R16, 0b00100000
DDRB, r16
R16
PORTB, r16
EXT_INT0:
EXT_INT1:
PCINT_0:
PCINT_2:
reti
RESET:
0xff
r16
0x04
r16
212
sei
; enable interrupts
BUTTON:
reti
rjmp LOOP
SNOOZE:
sleep
LOOP:
sbic PINB, 4
rjmp SNOOZE
cbi PORTB, 5
ldi R16, 128
SPIN1:
subi R16, 1
brne SPIN1
sbi PORTB, 5
ldi R16, 128
SPIN2:
subi R16, 1
brne SPIN2
rjmp LOOP
213
214
There are two possibilities for the timing of the data transfer from
the DMA controller to memory:
The controller can cause the processor to halt if it attempts to
access data in the same bank of memory into which the controller
is writing. This is the fastest option for the I/O device, but may
cause the processor to run more slowly because the processor
may have to wait until a full block of data is transferred.
The controller can access memory in memory cycles which are
not used by the particular bank of memory into which the DMA
controller is writing data. This approach, called cycle stealing,
is perhaps the most commonly used approach. (In a processor
with a cache that has a high hit rate this approach may not slow
the I/O transfer significantly).
DMA is a sensible approach for devices which have the capability of
transferring blocks of data at a very high data rate, in short bursts. It
is not worthwhile for slow devices, or for devices which do not provide
the processor with large quantities of data. Because the controller for
a DMA device is quite sophisticated, the DMA devices themselves
are usually quite sophisticated (and expensive) compared to other
types of I/O devices.
215
216
Device 2
Bus
Master
Grant
Grant
Request
217
...
Device n
Grant
Bus
arbiter Request
Request
218
Device 2
...
Device n
Distributed arbitration by self-selection Here, the devices themselves determine which of them has the highest priority. Each
device has a bus request line or lines on which it places a code
identifying itself. Each device examines the codes for all the requesting devices, and determines whether or not it is the highest
priority requesting device.
These arbitration schemes may also be used in conjunction with each
other. For example, a set of similar devices may be daisy chained
together, and this set may be an input to a priority encoded scheme.
There is one other arbitration scheme for serial buses distributed
arbitration by collision detection. This is the method used by the
Ethernet, and it will be discussed later.
219
220
Some problems can arise with memory mapped I/O in systems which
use cache memory or virtual memory. If a processor uses a virtual
memory mapping, and the I/O ports are allowed to be in a virtual
address space, the mapping to the physical device may not be consistent if there is a context switch or even if a page is replaced.
If physical addressing is used, mapping across page boundaries may
be problematic.
In many operating systems, I/O devices are directly addressable only
by the operating system, and are assigned to physical memory locations which are not mapped by the virtual memory system.
If the memory locations corresponding to I/O devices are cached,
then the value in cache may not be consistent with the new value
loaded in memory. Generally, either there is some method for invalidating cache that may be mapped to I/O addresses, or the I/O
addresses are not cached at all. We will look at the general problem of maintaining cache in a consistent state (the cache coherency
problem) in more detail when we discuss multi-processor systems.
221
12
11.7
9.9
10
8.2
DSP
16/32bit
6.6
6
4.9 5.2
8-bit
4
2
.
4-bit
91 92 93 94 95 96
The projected microcontroller sales for 2001 is 9.8 billion; for 2002,
222
9.6 billion; for 2003, 12.0 billion; for 2004, 13.0 billion; for 2005, 14
billion. (SIA projection.)
For DSP devices, it is 4.9 billion in 2002, 6.5 billion in 2003, 8.4
billion in 2004, and 9.4 billion in 2005.
223
Magnetic disks
A magnetic disk drive consists of a set of very flat disks, called platters, coated on both sides with a material which can be magnetized
or demagnetized.
The magnetic state can be read or written by small magnetic heads
located on mechanical arms which can move in and out over the
surfaces of the disks, very close to but not actually touching, the
surfaces.
224
Platters
Tracks
Sectors
Total storage is
(no. of platters) (no. of tracks/platter) (no. of sectors/track)
Typically, disks are formatted and bad sectors are noted in a table
stored in the controller.
225
Disks spin at speeds of 4200 RPM to 15,000 RPM. Typical speeds for
PC desktops are 7200 RPM and 10,000 RPM. Laptop disks usually
spin at 4200 or 5400 RPM.
Disk speed is usually characterized by several parameters:
average seek time, which is the average required for the read/write
head to be positioned over the correct track, typically about 8ms.
rotational latency which is the average time for the appropriate
sector to rotate to a point under the head, (4.17 ms for a 7200
RPM disk) and
transfer rate, this is about 5Mbytes/second for an early IDE drive.
(33 133 MB/s for an ATA drive, and 150 300 MB/s for a
SATA drive.) Typically, sustained rates are less than half the
maximum rates.
controller overhead also contributes some delay; typically 1 ms.
Assuming a sustained data transfer rate of 50MB/s, the time required
to transfer a 1 Kbyte block is
8ms. + 4.17ms. + 0.02 ms. + 1ms. = 13.2 ms.
To transfer a 1 Mbyte block in the same system, the time required is
8ms. + 4.17ms. + 20 ms. + 1ms. = 33 ms.
Note that for small blocks, most of the time is spent finding the data
to be transferred. This time is the latency of the disk.
226
227
mn1mn0mnXmn1mn1
228
mnparity bit
229
Failure tolerance
RAID levels above 0 provide tolerance to single disk failure. Systems
can actually rebuild a file system after the failure of a single disk.
Multiple disk failure generally results in the corruption of the whole
file system.
RAID level 0 actually makes the system more vulnerable to disk
failure failure of a single disk can destroy the data in the whole
array.
For example, assume a disk has a failure rate of 1%. The probability
of a single failure in a 2 disk system is
0.01 + (1. 0.01) 0.01 0.02 = 2%
Consider a RAID 3, 4, or 5 system with 4 disks. Here, 2 disks must
fail at the same time for a system failure.
Consider a 4 disk system with the same failure rate. The probability
of exactly two disks failing (and two not failing) is
(1 0.01)2 (0.01)2 0.0001 = 0.01%
230
Host
station
Host
station
231
232
233
Cat5 wire
Host
station
Host
station
Host
station
The maximum length of a single link is 100 m., and switches are
often linked by fibre optical cable.
The following pictures show the network in the Department:
The first shows the cable plant (where all the Cat-5 wires are connected to the switches).
The second shows the switches connecting the Department to the
campus network. It is an optical fiber network operating at 10 Gbit/s.
The optical fiber cables are orange.
The third shows the actual switches used for the internal network.
Note the orange optical fibre connection to each switch.
235
236
237
238
239
Multiprocessor systems
In order to perform computation faster, there are two basic strategies:
Increase the speed of individual computations.
Increase the number of operations performed in parallel.
At present, high-end microprocessors are manufactured using aggressive technologies, so there is relatively little opportunity to take the
first strategy, beyond the general trend (Moores law).
There are a number of ways to pursue the second strategy:
Increase the parallelism within a single processor, using multiple
parallel pipelines and fast access to memory. (e.g., the Cray
computers).
Use multiple commercial processors, each with its own memory
resources, interconnected by some network topology.
Use multiple commercial microprocessors in the same box
sharing memory and other resources.
The first of these approaches was successful for several decades, but
the low cost per unit of commercial microprocessors is so attractive
that the microprocessor based systems have the potential to provide
very high performance computing at relatively low cost.
240
Multiprocessor systems
A multiprocessor system might look as follows:
Processors
Interconnect
Network
Processors
Interconnect
Network
Memory
242
Cache
Global
bus
local
busses
tag
Shared
tag
Memory
tag
tag
Note that here each processor has its own cache. Virtually all current
high performance microprocessors have a reasonable amount of high
speed cache implemented on chip.
In a shared memory system, this is particularly important to reduce
contention for memory access.
243
244
245
246
247
The Intel Pentium class processors use a similar cache protocol called
the MESI protocol. Most other single-chip multiprocessors use this,
or a very similar protocol, as well.
The MESI protocol has 4 states:
modified the cache line has been modified, and is available only
in this cache (dirty)
exclusive no other caches have this line, and it is consistent
with memory (reserved)
shared line may also be in other caches, memory is up-to-date
(clean)
invalid cache line is not correct (invalid)
M
Yes
Yes
Yes
No
Memory is
Stale Valid
Valid
No
No
Maybe Maybe
248
Read hit
For a read hit, the processor takes data directly from the local cache
line, as long as the line is valid. (If it is not valid, it is a cache miss,
anyway.)
Read miss
Here, there are several possibilities:
If no other cache has the line, the data is taken from memory
and marked exclusive.
If one or more caches have a clean copy of this line, (either in
states exclusive or shared) they should signal the cache, and
each cache should mark its copy as shared.
If a cache has a modified copy of this line, it signals the cache
to retry, writes its copy to memory immediately, and marks its
cached copy as shared. (The requesting cache will then read the
data from memory, and have it marked as shared.)
249
Write hit
The processor marks the line in cache as modified. If the line was
already in state modified or exclusive, then that cache has the only
copy of the data, and nothing else need be done. If the line was in
state shared, then the other caches should mark their copies invalid.
(A bus transaction is required).
Write miss
The processor first reads the line from memory, then writes the word
to the cache, marks the line as modified, and performs a bus transaction so that if any other cache has the line in the shared or exclusive
state it can be marked invalid.
If, on the initial read, another cache has the line in the modified
state, that cache marks its own copy invalid, suspends the initiating
read, and immediately writes its value to memory. The suspended
read resumes, getting the correct value from memory. The word can
then be written to this cache line, and marked as modified.
250
False sharing
One type of problem possible in multiprocessor cache systems using
a write-invalidate protocol is false sharing.
This occurs with line sizes greater than a single word, when one processor writes to a line that is stored in the cache of another processor.
Even if the processors do not share a variable, the fact that an entry
in the shared line has changed forced the caches to treat the line as
shared.
It is instructive to consider the following example (assume a line size
of 4 32 bit words, and that all caches initially contain clean, valid
data):
Step Processor Action address
1
P1
write
100
P2
write
104
P1
read
100
P2
read
104
Note that addresses 100 and 104 are in the same cache line (the line
is 4 words or 16 bytes, and the addresses are in bytes).
Consider the MESI protocol, and determine what happens at each
step.
251
= 0;
}
sum[Pn] = tmp
This loop uses load instructions to bring the correct subset of numbers to the caches of each processor from the common main memory.
Each processor must have its own version of the loop counter variable
i, so it must be a private variable. Similarly for the partial sum,
tmp. The array sum[Pn] is a global array of partial sums, one from
each processor.
252
The next step is to add these many partial sums, using divide and
conquer. Half of the processors add pairs of partial sums, then a
quarter add pairs of the new partial sums, and so on until we have
the single, final sum.
In this example, the two processors must synchronize before the
consumer processor tries to read the result written to memory
by the producer processor; otherwise, the consumer may read the
old value of the data. Following is the code (half is private also):
half = 16;
/* 16 processors in multiprocessor*/
repeat
synch();
if (half%2
sum completion*/
!=0 && Pn == 0)
(half == 1);
253
254
255
The processor then races against all other processors that were similarly spin waiting to see who can lock the variable first. All processors
use a test-and-set instruction that reads the old value and stores a
1 (locked) into the lock variable. The single winner will see the 0
(unlocked), and the losers will see a 1 that was placed there by the
winner. (The losers will continue to write the variable with the locked
value of 1, but that doesnt change its value.) The winning processor
then executes the code that updates the shared data. When the winner exits, it stores a 0 (unlocked) into the lock variable, thereby
starting the race all over again.
The term usually used to describe the code segment between the lock
and the unlock is a critical section.
256
Load lock
variable S
No
Unlocked?
S=0
Yes
Succeed?
(S = 0)
Yes
Access shared
resource
Critical
section
Unlock
(Set S = 0)
Compete for
lock
257
Let us see how this spin lock scheme works with bus-based cache
coherency.
One advantage of this scheme is that it allows processors to spin wait
on a local copy of the lock in their caches. This dramatically reduces
the amount of bus traffic; The following table shows the bus and
cache operations for multiple processors trying to lock a variable.
Once the processor with the lock stores a 0 into the lock, all other
caches see that store and invalidate their copy of the lock variable.
Then they try to get the new value of 0 for the lock. (With write
update cache coherency, the caches would update their copy rather
than first invalidate and then load from memory.) This new value
starts the race to see who can set the lock first. The winner gets the
bus and stores a 1 into the lock; the other caches replace their copy
of the lock variable containing 0 with a 1.
They read that the variable is already locked and must return to
testing and spinning.
Because of the communication traffic generated when the lock is
released, this scheme has difficulty scaling up to many processors.
258
Step P0
P1
Has lock
P2
Bus Activity
None
lock=0)
2
Spin
Write-
and 0 sent
invalidate of
over bus
lock variable
from P0
Cache miss
Cache miss
Bus decides
to service P2
cache miss
Cache miss
bus busy)
for P2 satisfied
Lock = 0
fied
=0 invalidate of
from P2
the Writeso invalidate of
shared data
Spins
from P1
None
259
Interconnect
Network
Processors
260
The following diagrams show some useful network topologies. Typically, a topology is chosen which maps onto features of the program
or data structures.
Ring
1D mesh
2D mesh
2D torus
Tree
3D grid
Hypercube
Butterfly
262
Processor
Processor
Processor
Processor
Switch
Processor
Processor
Processor
Processor
263
/* 16 processors
*/
264
The next step is to get the sum of each subset. This step is simply
a loop that every execution unit follows; read a word from local
memory and add it to a local variable:
receive(A1);
sum = 0;
for (i = 0; i<1000000; i = i + 1)
sum = sum + A1[i];
Again, the final step is adding these 16 partial sums. Now, each
partial sum is located in a different execution unit. Hence, we must
use the interconnection network to send partial sums to accumulate
the final sum.
Rather than sending all the partial sums to a single processor, which
would result in sequentially adding the partial sums, we again apply
divide and conquer. First, half of the execution units send their
partial sums to the other half of the execution units, where two partial
sums are added together. Then one quarter of the execution units
(half of the half) send this new partial sum to the other quarter of the
execution units (the remaining half of the half) for the next round of
sums.
265
/* 16 processors */
repeat
half = (half+1)/2;
receive(tmp);
if (Pn < (limit/2-1))
limit = half;
until
(half == 1);
sum */
This code divides all processors into senders or receivers and each
receiving processor gets only one message, so we can presume that a
receiving processor will stall until it receives a message. Thus, send
and receive can be used as primitives for synchronization as well as
for communication, as the processors are aware of the transmission
of data.
266
1
(1 P ) + P/N
267
Gustafsons law
One of the advantages of increasing the amount of computation available for a problem is that problems of a larger size can be attempted.
So, rather than keeping the problem size fixed, suppose we can formulate the problem to try to use parallelism to solve a larger problem
in the same amount of time. (Gustafson called this scaled speedup).
The idea here is that, for certain problems, the serial part use a nearly
constant amount of time, while the parallel part can scale with the
number of processors.
Assume that a problem has a serial component s and a parallel component p.
So, if we have N processors, the time to complete the equivalent
computation on a single processor is s + N p.
The speedup S(N ) is:
(single processor time)/(N processor time) or (s + N p)/(s + p)
Letting be the sequential fraction of the parallel execution time,
= s/(s + p)
then S(N ) = + N (1 )
If is small, then S(N ) N .
For problems fitting this model, the speedup is really the best one
can hope from applying N processors to a problem.
268
So, we have two models for analyzing the potential speedup for parallel computation.
They differ in the way they determine speedup.
Let us think of a simple example to show the difference between the
two:
Consider booting a computer system. It may be possible to reduce
the time required somewhat by running several processes simultaneously, but the serial nature will pose a lower limit on the amount of
time required. (Amdhals Law).
Gustafsons Law would say that, in the same time that is required
to boot the processor, more facilities could be made available; for
example, initiating more advanced window managers, or bringing up
peripheral devices.
A common explanation of the difference is:
Suppose a car is traveling between two cities 90 km. apart. If the
car travels at 60 km/h for the first hour, then it can never average
100 km/h between the two cities. (Amdhals Law.)
Suppose a car is traveling at 60 km. per hour for the first hour. If it
is then driven consistently at a speed greater than 100 km/h, it will
eventually average 100 km/h. (Gustafsons Law.)
269