Superscalar ILP techniques

ADVANCED
COMPUTER
ARCHITECTURE
BY
DR. RADWA M. TAWFEEK
SUPERSCALAR PROCESSORS
LAST LECTURE
• Cache Memory
• Block Location
• Replacement Strategy
• Write Strategy
THIS LECTURE
• ILP
• Superscalar Processor
THE METHOD FOR EXPLOITING PARALLELISM
• The key to higher performance in microprocessors for a broad range of

applications is the ability to exploit fine-grain, instruction-level
parallelism:
+ pipelining
+ superscalar implementation
+ specifying multiple independent operations per instruction
Concept of VLIW
+ multiple processors
PARALLEL PROCESSING
Processing instructions in parallel requires three major tasks:

1. checking dependencies between instructions to determine
which instructions can be grouped together for parallel
execution;
2. assigning instructions to the functional units on the hardware;
3. determining when instructions are initiated placed together
into a single word.
MAJOR CATEGORIES
VLIW – Very Long Instruction Word

EPIC – Explicitly Parallel Instruction Computing
MAJOR CATEGORIES [2]
SUPERSCALAR
DEFINITION AND CHARACTERISTICS
• Superscalar processing is the ability to initiate multiple instructions during the

same clock cycle.
• A typical Superscalar processor fetches and decodes the incoming instruction
stream several instructions at a time.
• Only independent instructions can be executed in parallel without causing a wait
state.
• Superscalar architecture exploit the potential of ILP(Instruction Level
Parallelism).
• The amount of instruction-level parallelism varies widely depending on the type
of code being executed.
WHAT IS SUPERSCALAR?
A Superscalar machine executes multiple independent

instructions in parallel. They are pipelined as well.
• “Common” instructions (arithmetic, load/store, conditional branch) can be

executed independently.
• Equally applicable to RISC & CISC, but more straightforward in RISC
machines.
• The order of execution is usually assisted by the compiler.
PIPELINING IN SUPERSCALAR
PROCESSORS
• In order to fully utilise a superscalar processor of degree m, m
instructions must be executable in parallel. This situation may not be
true in all clock cycles. In that case, some of the pipelines may be
stalling in a wait state.
• In a superscalar processor, the simple operation latency should
require only one cycle, as in the base scalar processor.
SUPERSCALAR DATAPATH
Are all ALUs must be identical?

SUPERSCALAR PIPELINE DIAGRAM
• Fetching and dispatching two instructions per cycle

TWO-WAY SUPERSCALAR
• Two-way superscalar processor

executing two instructions on
each cycle.
• For this program, the processor
has a CPI of 0.5.
• Designers commonly refer to the
reciprocal of the CPI as the
instructions per cycle, or IPC.
• This processor has an IPC of 2
on this program.
DATA DEPENDENCE AND SUPERSCALAR
Find the dependencies in this program

lw $t0, 40($s0)
add $t1, $t0, $s1
sub $t0, $s2, $s3
and $t2, $s4, $t0
or $t3, $s5, $s6
sw $s7, 80($t3)
DATA DEPENDENCE AND SUPERSCALAR
• Executing many instructions

simultaneously is difficult because of
dependencies
• The add instruction is dependent on $t0,
produced by the lw instruction, it cannot
be issued at the same time as lw.
• the add instruction stalls for yet another cycle
• The other dependencies are handled by

forwarding This program, requires nine cycles to issue six
instructions, for an CPI of 1.5.
PIPELINE DIVERSITY
• It is ideal to make all the execution units in the pipeline identical

• There are different types of instructions, each of them needs specific sub
computations.
• To execute all of them on the same processing element design increase
both complexity and inefficiency
• For parallel pipelining there is strong motivation not to unify all the
execution hardware, but instead implement multiple different execution
units.
• This is called diversified pipelines
SUPERSCALAR WITH DIFFERENT ALUS
FUNCTIONS
How would the register file know which output values would be directed to which ALU?
SUPERSCALAR WITH DIFFERENT ALUS
FUNCTIONS
DIVERSIFIED PIPELINE
• Advantages:
• Each pipe can be customized for a
particular instruction type, resulting in an
efficient hardware design
• Considerations:
• Number and mix of functional units
• First generation of superscalar pipeline

integrated floating point and non-
floating point processing elements
SUPERSCALAR IMPLEMENTATION
• Simultaneously fetch multiple instructions

• Logic to determine true dependencies involving register values
• Mechanisms to communicate these values
• Mechanisms to initiate multiple instructions in parallel
• Resources for parallel execution of multiple instructions
• Mechanisms for committing process state in correct order
INSTRUCTION ISSUE POLICIES
• Order in which instructions are fetched
• Order in which instructions are executed
• Order in which instructions update registers and memory values (order of completion)
Standard Categories:
• In-order issue with in-order completion
• In-order issue with out-of-order completion
• Out-of order issue with out-of-order completion

DYNAMIC PIPELINING
• In the scalar pipeline there are state registers (buffers) in between the
stages
• Using superscalar pipeline, a multientry buffer must be used to hold the
data of each instruction to be executed in parallel
DYNAMIC PIPELINING (2)
• In the scalar pipeline, when stall happens, it prevents the data in the buffer to
flow to the next stage
• In superscalar pipeline, stalling all the buffer will cause stalling instructions
which don’t need to be stalled (revise the example shown in slide 17)
• So more complex multientry buffer design is required.
• One enhancement is to add capability to explicitly address each individual entry
in the buffer, and independently control the reading and writing of each entry.
• In superscalar, trailing instructions can bypass leading stalled instruction, which
causes out of order execution, theses superscalars are called Dynamic pipelining
DYNAMIC PIPELINING (3)
SUPERSCALAR EXECUTION
COMMITTING OR RETIRING INSTRUCTIONS
Results need to be put into order (commit or retire)
• Results sometimes must be held in temporary storage until it is

certain they can be placed in “permanent” storage.
(either committed or retired/flushed)
• Temporary storage requires regular clean up – overhead – done

in hardware.
EXAMPLE
• A superscalar pipeline capable of fetching and decoding 2 instructions at a time

• ‣ Instructions are fetched in pair. the next two instructions must wait until the pair of decode pipeline
stages has cleared.
• - having 3 separate function units (e.g., two integer arithmetic and one floating-point
arithmetic)
• - 2 instances of the write-back pipeline stage
• - 6 instruction code fragment with the following constraints:
• ‣ I1 requires two cycles to execute
• ‣ I3 and I4 conflict for the same functional unit (e.g., both need floating-point arithmetic) ‣ I5 depends on
the value produced by I4
• ‣ I5 and I6 conflict for a functional unit
• - When there is a conflict for a functional unit, or when a functional unit requires more than
one cycle to generate a result, instructions temporarily stall.
IN-ORDER ISSUE -- IN-ORDER COMPLETION
Issue instructions in the order they occur:
• Not very efficient
• Instructions must stall if necessary (and stalling in superscalar

is expensive)
IN-ORDER ISSUE -- IN-ORDER
COMPLETION (EXAMPLE)
Assume:
• I1 requires 2 cycles to execute
• I3 & I4 conflict for the same functional unit
• I5 depends upon value produced by I4
• I5 & I6 conflict for a functional unit
IN-ORDER ISSUE -- OUT-OF-ORDER COMPLETION
(EXAMPLE)
Again:
OUT-OF-ORDER ISSUE -- OUT-OF-ORDER COMPLETION
• Decouple decode pipeline from execution pipeline
• Can continue to fetch and decode until the “window” is full
• When a functional unit becomes available an instruction can be executed

(usually in as much in-order as possible)
• Since instructions have been decoded, processor can look ahead

OUT-OF-ORDER ISSUE -- OUT-OF-ORDER
COMPLETION (EXAMPLE)
Again:
Note: I5 depends upon I4, but I6 does not
SOME ARCHITECTURES
• PowerPC 604
• six independent execution units:
• Branch execution unit
• Load/Store unit
• 3 Integer units
• Floating-point unit
• in-order issue
• register renaming
• Power PC 620
• provides in addition to the 604 out-of-order issue
• Pentium
• three independent execution units:
• 2 Integer units
• Floating point unit
• in-order issue
ASSIGNMENT 2
Search and write a report for 5 real superscalar

processors (other than mentioned in the slides), make
comparison between them in the number of
processing elements and its diversification, and the
order of issue, execution of the programs

Superscalar ILP techniques

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Superscalar ILP techniques

Hochgeladen von

Copyright:

Verfügbare Formate

ADVANCED

• The key to higher performance in microprocessors for a broad range of

Processing instructions in parallel requires three major tasks:

VLIW – Very Long Instruction Word

• Superscalar processing is the ability to initiate multiple instructions during the

A Superscalar machine executes multiple independent

• “Common” instructions (arithmetic, load/store, conditional branch) can be

Are all ALUs must be identical?

• Fetching and dispatching two instructions per cycle

• Two-way superscalar processor

Find the dependencies in this program

• Executing many instructions

• The other dependencies are handled by

• It is ideal to make all the execution units in the pipeline identical

• First generation of superscalar pipeline

• Simultaneously fetch multiple instructions

• Order in which instructions are fetched

• Order in which instructions are executed

• In-order issue with out-of-order completion

• Out-of order issue with out-of-order completion

Results need to be put into order (commit or retire)

• Results sometimes must be held in temporary storage until it is

• Temporary storage requires regular clean up – overhead – done

• A superscalar pipeline capable of fetching and decoding 2 instructions at a time

Issue instructions in the order they occur:

• Not very efficient

• Instructions must stall if necessary (and stalling in superscalar

• Decouple decode pipeline from execution pipeline

• Can continue to fetch and decode until the “window” is full

• When a functional unit becomes available an instruction can be executed

• Since instructions have been decoded, processor can look ahead

Search and write a report for 5 real superscalar

Das könnte Ihnen auch gefallen