Beruflich Dokumente
Kultur Dokumente
1
CPU organization
2
Register Organization
3
User-visible Registers
General categories based on function Design trade off between general purpose
– General purpose and specialized registers
» Can be assigned a variety of functions – General purpose registers maximize flexibility
» Ideally, they are defined orthogonally to the in instruction design
operations within the instructions – Special purpose registers permit implicit
– Data register specification in instructions -- reduces
» These registers only hold data register field size in an instruction
– Address – No clear “best” design approach
» These registers only hold address How many registers are enough
information – More registers permit more operands to be held
» Examples: general purpose address within the CPU -- reducing memory bandwidth
registers, segment pointers, stack pointers, requirements to some extent
index registers – More registers cause an increase in the field
– Condition codes sizes needed to specify registers in an
» Visible to the user but values set by the instruction word
CPU as the result of performing operations – Locality of reference may not support too many
» Example code bits: zero, positive, overflow registers
» Bit values are used as the basis for – Most machines use 8-32 registers (does not
conditional jump instructions include RISC machines with register
windowing -- will get to that later!)
EE 4504 Section 8 7 EE 4504 Section 8 8
4
Control and status registers
5
Figure 11.3 Example register organizations Figure 11.4 Extensions to 32 bits microprocessors
6
Instruction Cycle
7
Instruction pipelining
The instruction cycle state diagram clearly An ideal pipeline divides a task into k
shows the sequence of operations that take independent sequential subtasks
place in order to execute a single – Each subtask requires 1 time unit to complete
instruction – The task itself then requires k time units to
A “good” design goal of any system is to complete
have all of its components performing For n iterations of the task, the execution
useful work all of the time -- high times will be:
efficiency – With no pipelining: nk time units
Following the instruction cycle in a – With pipelining: k + (n-1) time units
sequential fashion does not permit this Speedup of a k-stage pipeline is thus
level of efficiency
Compare the instruction cycle to an S = nk / [k+(n-1)] ==> k (for large n)
automobile assembly line
– Perform all tasks concurrently, but on different
(sequential) instructions
– The result is temporal parallelism
– Result is the instruction pipeline
8
First step: instruction (pre)fetch – Alternative approaches
– Divide the instruction cycle into two (equal??) » Finer division of the instruction cycle: use
“parts” a 6-stage pipeline
» I-fetch Instruction fetch
» Everything else (execution phase) Decode opcode
– While one instruction is in “execution,” overlap Calculate operand address(es)
the prefetching of the next instruction Fetch operands
» Assumes the memory bus will be idle at Perform execution
some point during the execution phase Write (store) result
» Reduces the time to fetch an instruction to » Use multiple execution “functional units” to
zero (ideal situation) parallelize the actual execution phase of
– Problems several instructions
» The two parts are not equal in size » Use branching strategies to minimize
» Branching can negate the prefetching branch impact
As a result of the brach instruction, you
have prefetched the “wrong”
instruction
9
Figure 11.12 Pipelined execution of 9 instructions Figure 11.13 Impact of a branch after instruction 3
in 14 time units vs. 54 (to instruction 15)
10
Pipeline Limitations
11
Branching
– For the pipeline to have the desired operational
speedup, we must “feed it” with long strings of
instructions
» However, 15-20% of instructions in an
assembly-level stream are (conditional)
branches
» Of these, 60-70% take the branch to a target
address
» Impact of the branch is that pipeline never
really operates at its full capacity -- limiting
the performance improvement that is
derived from the pipeline
– The average time to complete a pipelined
instruction becomes
– A number of techniques can be used to Loss of performance resulting from conditional branches [Lil88]
minimize the impact of the branch instruction pe = pbpt
(the branch penalty)
EE 4504 Section 8 23 EE 4504 Section 8 24
12
– Multiple streams – Look ahead, look behind buffer (loop buffer)
» Replicate the initial portions of the pipeline » Many conditional branches operations are
and fetch both possible next instructions used for loop control
» Increases chance of memory contention » Expand prefetch buffer so as to buffer the
» Must support multiple streams for each last few instructions executed in addition to
instruction in the pipeline the ones that are waiting to be executed
– Prefetch branch target » If buffer is big enough, entire loop can be
» When the branch instruction is decoded, held in it -- reducing branch penalty
begin to fetch the branch target instruction
and place in a second prefetch buffer
» If the branch is not taken, the sequential
instructions are already in the pipe -- no Pending
loss of performance Instructions
» If the branch is taken, the next instruction
has been prefetched and results in minimal
branch penalty (don’t have to incur a PC
memory read operation at the end of the Previous
branch to fetch the instruction) Instructions
13
– Branch prediction
» Make a good guess as to which instruction
will be executed next and start that one
down the pipeline
» If the guess turns out to be right, no loss of
performance in the pipeline
» If the guess was wrong, empty the pipeline
and restart with the correct instruction --
suffering the full branch penalty
» Static guesses: make the guess without
considering the runtime history of the
program
Branch never taken
Branch always taken
Predict based on the opcode
» Dynamic guesses: track the history of
conditional branches in the program
Taken / not taken switch
History table
Figure 11.16 Branch prediction using 2 history bits
14
Superscalar and Superpipelined
Processors
15
Superscalar
– Implement the CPU such that more than one
instruction can be performed (completed) at a
time
– Involves replication of some or all parts of the
CPU/ALU
– Examples:
» Fetch multiple instructions at the same time
» Decode multiple instructions at the same
time
» Perform add and multiply at the same time
» Perform load/stores while performing ALU
operation
– Degree of parallelism and hence the speedup of
the machine goes up as more instructions are
executed in parallel
16
Superscalar design limitations
Data dependencies: must insure computed Instruction issue policy: in what order are
results are the same as would be computed instructions issued to the execution unit
on a strictly sequential machine and in what order do they finish?
– Two instructions can not be executed in parallel – In-order issue, in-order completion
if the (data) output of one is the input of the » Simplest method, but severely limits
other or if they both write to the same output performance
location » Strict ordering of instructions: data and
– Consider: procedural dependencies or resource
conflicts delay all subsequent instructions
S1: A=B+C » “Slow” execution of some instructions
S2: D=A+1 delay all subsequent instructions
S3: B=E+F
– In-order issue, out-of-order completion
S4: A=E+3
» Any number of instructions can be executed
Resource dependencies at a time
– In the above sequence of instructions, the adder » Instruction issue is still limited by resource
unit gets a real workout! conflicts or data and procedural
– Parallelism is limited by the number of adders dependencies
in the ALU » Output dependencies resulting from out-of-
order completion must be resolved
» “Instruction” interrupts can be tricky
EE 4504 Section 8 33 EE 4504 Section 8 34
17
– Out-of-order issue, out-of-order completion Register renaming
» Decode and execute stages are decoupled – Output dependencies and antidependencies are
via an instruction buffer “window” eliminated by the use of a register “pool” as
» Decoded instructions are “stored” in the follows
window awaiting execution » For each instruction that writes to a register
» Functional units will take instructions from X, a “new” register X is instantiated
the window in an attempt to stay busy » Multiple “register Xs” can co-exist
This can result in out-of-order – Consider
execution
S1: R3 = R3 + R5
S1: A=B+C S2: R4 = R3 + 1
S2: D=E+1 S3: R3 = R5 + 1
S3: G=E+F S4: R7 = R3 + R4
S4: H=E*3
» “Antidependence” class of data becomes
dependencies must be dealt with
S1: R3b = R3a + R5a
S2: R4b = R3b + 1
S3: R3c = R5a + 1
S4: R7b = R3c + R4b
18
Summary
19