Multiple Issue Processors Explained

Advanced Topics in Computer
Architecture
ECE 7373
Pauline Markenscoff
N320 Engineering Building 1
E-mail: markenscoff@uh.edu
Multiple issue processors
Single issue processors:
Eliminate data and control stalls to achieve an ideal CPI of 1.
Multiple issue processors:
Reduce CPI below 1.
Multiple issue processors:
Superscalar processors (Dynamic issue capability)

- Statically scheduled (in-order execution)
! Diminishing advantages as the issue width grows;
used primarily for narrow widths, usually for just two instructions
! Early superscalar processors
! Embedded processors
- Dynamically scheduled (out-of-order execution)
! using techniques based on Tomasulos Algorithm
! Most leading-edge desktops and servers
! Typical superscalar from 0 to 8 processors
VLIW (Very long instruction word) processors
(Static issue capability)
- Inherently statically scheduled
VLIW processors
Issue a xed number of instructions formatted as either

- one large instruction or
- as a xed instruction packet with the parallelism among
instructions explicitly indicated by the instruction
(EPIC Explicitly Parallel Instruction Computers)
Intel IA-64 (Itanium)
A superscalar has dynamic issue capability
The hardware makes dynamically any decisions about
multiple issue
A VLIW processor has static issue capability.

Compiler makes any decisions about multiple issue
Approaches to Multiple Issue
Fig. 3.15
Statically Scheduled Superscalar Processors
Instructions issue in order

All pipeline hazards are checked at issue time
Pipeline control logic must check for hazards
- among the instructions being issued in a given clock cycle,
and
- among the issuing instruction and all those still in execution.
Statically Scheduled Superscalar Processors

If some instruction in the instruction stream

is dependent (i.e, will cause a data hazard)
or
does not meet the issue criteria (structural hazard)
then only the instructions preceding it will be issued.
The pipeline would receive from the instruction fetch unit from one to
k instructions, where k is the width of the issue packet.
Issue packet
- The set of instructions that could potentially issue.
If an instruction would cause a structural hazard or data hazard
- either due to an earlier instruction already in execution
- or earlier in the issue packet
then the instruction is not issued.
Issue checks are complex

Performing them in one clock cycle would mean that the
issue logic determined the minimum clock cycle.
In many statically and all dynamically scheduled

superscalars
Issue stage is split and pipelined, so that it can issue every

clock cycle.
A Statically Scheduled Superscalar MIPS Processor
Assume
Two instructions can be issued per clock cycle
One of the instructions can be an integer operation
- a load, store, move (Integer or FP)
- branch or integer ALU
The other can be any FP operation.
.
Issue of an integer operation in parallel with a FP is much
simpler and less demanding than arbitrary dual issue.
Integer and FP operations use different register sets and different

functional units.
Most hazard possibilities within the issue packet are eliminated
- Sufcient in many cases to look only at the opcodes of the
instructions
The need for additional hardware is minimized.
Only difculties when the integer instruction is a FP load, store or move,
i.e.,
If the rst instruction is a FP load and the second a FP operation or

If the rst instruction is a FP operation and the second a FP store
Then possibility of
RAW hazard
- When the second instruction of the pair depends on the rst
Structural hazard
- Contention for the FP register ports
Allowing FP loads and stores to issue with FP

operations
creates the need for an additional read/write port on the FP

register le
increases the need for bypass paths
- to avoid RAW hazards
But
Highly desirable capability for performance reasons.
There is also possibility of WAR and WAW hazards across
issue packets boundaries.
The use of the restriction that one instruction is

integer and the other one is FP
represents a structural hazard but

reduces complexity of hazard detection
It is common in multiple-issue processors.
Issuing two instructions per cycle will require
fetching and decoding 64 bits of instructions.
Early superscalars often limited the placement of the

instruction types
Integer instruction must be rst
Modern superscalars dropped this restriction.
Assuming instruction placement is not limited
Steps in fetch and issue:
- Fetch two instructions from the cache

- Determine whether zero, one or two instructions can issue
- Issue them to the correct functional unit
Superscalar pipeline
Assume
All FP ops are adds (3 execution clock cycles)
Integer instruction is always shown rst, although it may be the second
instruction in the issue packet.
The rate at which instructions can be issued has been substantially boosted.
To improve the rate at which instructions are executed
Pipelined FP units
Multiple independent FP units.
Complication
Maintaining precise exception model
Possibility of an imprecise exception:
A FP instruction can nish execution after an integer instruction that is

later in the program
The FP instruction exception could be detected after the integer
instruction completed.
Need to
restore a precise exception state before resuming execution or
delaying instruction completion until we know an exception is
impossible.
Maintaining peak throughput (CPI=0.5) for a dual-issue
pipeline much harder than for a single issue pipeline.
Single issue pipeline
Loads had a latency of one clock cycle which prevented one instruction from
using the result of the load without stalling.
Dual issue pipeline
The result of a load cannot be used on the same clock cycle or the next clock
cycle and hence the next two or three instructions cannot use the result of the
load without stalling, depending on whether the load is the rst or second
instruction in the pair.
Similarly
the branch delay for a taken branch
becomes either two or three instructions
depending on whether the branch
is the rst or the second instruction
of a pair.
Conclusion
To effectively exploit parallelism available in a superscalar
processor we need
- more ambitious compiler or
- hardware scheduling techniques.
Because of the diminishing advantages of a statically
scheduled superscalar as the issue width grows, most
designers choose to implement either

a VLIW or
a dynamically scheduled superscalar.
Multiple Instruction Issue with Dynamic Scheduling
Dynamic scheduling can increase performance
Even in the presence of hazards

Allows the processor to eliminate the issue restrictions until the
hardware runs out of reservation stations.
Extend Tomasulos Algorithm
to support a dual-issue superscalar pipeline
Instructions are issued to the reservation stations in-order

(otherwise we would have violation of program semantics).
How is branch prediction integrated into a
dynamically scheduled pipeline?
- Instructions are fetched and issued based on branch

predictions, but executed when the branch has
completed (IBM 360/91):
! Static branch prediction scheme.
- Instructions are executed based on branch predictions.
! Speculation
Multiple issue with Speculation
" Process multiple instructions per clock assigning

reservation stations and reorder buffers to the
instructions.
" Must be able to handle multiple commits per clock

cycle.
Consider the following loop:
Loop:LD R2, 0(R1) ; R2 points to array element

DADDIU R2, R2, #1 ; increment array element
SD R2, 0(R1) ; store result
DADDIU R1, R1, #8 ; increment pointer
BNE R2, R3, LOOP ; branch if not last element
Assume separate integer functional units for
Effective address calculation

ALU operations
Branch condition evaluation
Assume
A dual issue dynamically scheduled processor
- Without and with speculation
Up to two instructions of any type can commit per clock.
Time of Issue, Execution, and Writing result for a
Dual-issue version of our pipeline without speculation
Fig. 3.19
Any instructions following a branch cannot start execution until after the
branch condition has been evaluated.
For 3 iterations:
Issue rate:14 instructions in 8 clock cycles=14/8= 1.75
Execution rate:15 instructions in 19 clock cycles=15/19= 0.79
Dual-issue version of our pipeline without speculation
Fig. 3.20
For 3 iterations:
Issue rate= 1.75 (14 instructions in 8 clock cycles=14/8=1.75)
Execution rate=0.79 (15 instructions in 19 clock cycles=15/19=0.79)
Because completion rate falls behind the issue rate rapidly, the
nonspeculative processor will stall when a few more iterations are issued!
Performance of nonspeculative processor can be
improved by allowing memory access instructions to
complete effective address calculation before a
branch is decided.
Improvement will be small, unless speculative

memory accesses are allowed.
Dual-issue version of our pipeline with speculation
Fig. 3.20
Instructions following a branch can start execution before the branch

condition has been evaluated.
Issue rate:14 instructions in 8 clock cycles=14/8 = 1.75
Execution rate:15 instructions in 13 clock cycles=15/13= 1.15
The VLIW Approach
Major factor limiting wider-issue superscalar processors:
Growth in overhead
VLIW processors
Issue a xed number of instructions formatted as either
- one large instruction or
- as a xed instruction packet with the parallelism among instructions
explicitly indicated by the instruction
Use multiple independent functional units
- To keep functional units busy there must be enough parallelism in a code
sequence to ll the available operation slots.
The VLIW Approach
Parallelism is uncovered by the compiler by unrolling loops

and scheduling the code.
Local scheduling techniques

- If unrolling generates straight line code
Global scheduling techniques
- Scheduling code across branches.
Consider again the loop that

Increments a vector of values by a scalar stored in F2;
Starts with the element of the vector at location 0(R1) which is the highest
address and end at 8(R2).
Loop: L.D F0, 0(R1) ; F0=array element

ADD.D F4, F0, F2 ; add scalar in F2
S.D F4, 0(R1) ; store result
DADDUI R1, R1, #-8 ; decrement pointer
; 8 bytes (per DW)
BNE R1, R2, Loop ; branch R1!=R2
Without any scheduling
Clock cycle issued
Loop: L.D F0, 0(R1) 1

stall 2
ADD.D F4, F0, F2 3
stall 4
stall 5
S.D F4, 0(R1) 6
DADDUI R1, R1, #-8 7
stall 8
BNE R1, R2, Loop 9
stall 10
10 clock cycles per iteration
The VLIW Approach
Consider a VLIW processor that can issue in every clock cycle
Two memory references
Two FP operations
One integer operation/branch
The instruction would have a set of elds for each functional
unit
16-24 bits per unit
Instruction length of between 80 and 120 bits.
Fig. 3.16
Issue rate: 23 operations in 9 clock cycles= 2.5 operations per cycle

Execution rate: 7 results in 9 clock cycles= 0.77 results per cycle cycle
Efciency: percentage of available slots that contain an operation about 60%
Loop: L.D F0, 0(R1)
stall
ADD.D F4, F0, F2
stall
stall
S.D F4, 0(R1)
DADDUI R1, R1, #-8
stall
BNE R1, R2, Loop
stall
Unroll as many times as

necessary to eliminate any stalls
Fig. 3.16
Drawbacks:
Increase in code size
- Ambitious unrolling of loops
- Instructions might not be full and unused functional units are translated to
wasted bits.
Limitations of lockstep operation
- A stall in any functional unit must cause the entire processor to stall.

Multiple Issue Processors Explained

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Multiple Issue Processors Explained

Hochgeladen von

Copyright:

Verfügbare Formate

Advanced Topics in Computer

Multiple issue processors:

Superscalar processors (Dynamic issue capability)

Issue a xed number of instructions formatted as either

A VLIW processor has static issue capability.

Instructions issue in order

Statically Scheduled Superscalar Processors

Issue checks are complex

In many statically and all dynamically scheduled

Issue stage is split and pipelined, so that it can issue every

Integer and FP operations use different register sets and different

If the rst instruction is a FP load and the second a FP operation or

Allowing FP loads and stores to issue with FP

creates the need for an additional read/write port on the FP

The use of the restriction that one instruction is

represents a structural hazard but

Early superscalars often limited the placement of the

Steps in fetch and issue:

- Fetch two instructions from the cache

Possibility of an imprecise exception:

A FP instruction can nish execution after an integer instruction that is

Dynamic scheduling can increase performance

Even in the presence of hazards

Instructions are issued to the reservation stations in-order

- Instructions are fetched and issued based on branch

Multiple issue with Speculation

" Process multiple instructions per clock assigning

" Must be able to handle multiple commits per clock

Loop:LD R2, 0(R1) ; R2 points to array element

Assume separate integer functional units for

Effective address calculation

Improvement will be small, unless speculative

Instructions following a branch can start execution before the branch

The VLIW Approach

Parallelism is uncovered by the compiler by unrolling loops

Local scheduling techniques

Consider again the loop that

Loop: L.D F0, 0(R1) ; F0=array element

Loop: L.D F0, 0(R1) 1

Issue rate: 23 operations in 9 clock cycles= 2.5 operations per cycle

Unroll as many times as

Das könnte Ihnen auch gefallen