Sie sind auf Seite 1von 36

ADVANCED

COMPUTER
ARCHITECTURE
BY
DR. RADWA M. TAWFEEK
SUPERSCALAR PROCESSORS
LAST LECTURE

• Cache Memory
• Block Location
• Replacement Strategy
• Write Strategy
THIS LECTURE

• ILP
• Superscalar Processor
THE METHOD FOR EXPLOITING PARALLELISM

• The key to higher performance in microprocessors for a broad range of


applications is the ability to exploit fine-grain, instruction-level
parallelism:
+ pipelining
+ superscalar implementation
+ specifying multiple independent operations per instruction
Concept of VLIW

+ multiple processors
PARALLEL PROCESSING

Processing instructions in parallel requires three major tasks:


1. checking dependencies between instructions to determine
which instructions can be grouped together for parallel
execution;
2. assigning instructions to the functional units on the hardware;
3. determining when instructions are initiated placed together
into a single word.
MAJOR CATEGORIES

VLIW – Very Long Instruction Word


EPIC – Explicitly Parallel Instruction Computing
MAJOR CATEGORIES [2]
SUPERSCALAR
DEFINITION AND CHARACTERISTICS

• Superscalar processing is the ability to initiate multiple instructions during the


same clock cycle.
• A typical Superscalar processor fetches and decodes the incoming instruction
stream several instructions at a time.
• Only independent instructions can be executed in parallel without causing a wait
state.
• Superscalar architecture exploit the potential of ILP(Instruction Level
Parallelism).
• The amount of instruction-level parallelism varies widely depending on the type
of code being executed.
WHAT IS SUPERSCALAR?

A Superscalar machine executes multiple independent


instructions in parallel. They are pipelined as well.

• “Common” instructions (arithmetic, load/store, conditional branch) can be


executed independently.
• Equally applicable to RISC & CISC, but more straightforward in RISC
machines.
• The order of execution is usually assisted by the compiler.
PIPELINING IN SUPERSCALAR
PROCESSORS
• In order to fully utilise a superscalar processor of degree m, m
instructions must be executable in parallel. This situation may not be
true in all clock cycles. In that case, some of the pipelines may be
stalling in a wait state.
• In a superscalar processor, the simple operation latency should
require only one cycle, as in the base scalar processor.
SUPERSCALAR DATAPATH

Are all ALUs must be identical?


SUPERSCALAR PIPELINE DIAGRAM

• Fetching and dispatching two instructions per cycle


TWO-WAY SUPERSCALAR

• Two-way superscalar processor


executing two instructions on
each cycle.
• For this program, the processor
has a CPI of 0.5.
• Designers commonly refer to the
reciprocal of the CPI as the
instructions per cycle, or IPC.
• This processor has an IPC of 2
on this program.
DATA DEPENDENCE AND SUPERSCALAR

Find the dependencies in this program


lw $t0, 40($s0)
add $t1, $t0, $s1
sub $t0, $s2, $s3
and $t2, $s4, $t0
or $t3, $s5, $s6
sw $s7, 80($t3)
DATA DEPENDENCE AND SUPERSCALAR

• Executing many instructions


simultaneously is difficult because of
dependencies
• The add instruction is dependent on $t0,
produced by the lw instruction, it cannot
be issued at the same time as lw.
• the add instruction stalls for yet another cycle

• The other dependencies are handled by


forwarding This program, requires nine cycles to issue six
instructions, for an CPI of 1.5.
PIPELINE DIVERSITY

• It is ideal to make all the execution units in the pipeline identical


• There are different types of instructions, each of them needs specific sub
computations.
• To execute all of them on the same processing element design increase
both complexity and inefficiency
• For parallel pipelining there is strong motivation not to unify all the
execution hardware, but instead implement multiple different execution
units.
• This is called diversified pipelines
SUPERSCALAR WITH DIFFERENT ALUS
FUNCTIONS

How would the register file know which output values would be directed to which ALU?
SUPERSCALAR WITH DIFFERENT ALUS
FUNCTIONS
DIVERSIFIED PIPELINE

• Advantages:
• Each pipe can be customized for a
particular instruction type, resulting in an
efficient hardware design

• Considerations:
• Number and mix of functional units

• First generation of superscalar pipeline


integrated floating point and non-
floating point processing elements
SUPERSCALAR IMPLEMENTATION

• Simultaneously fetch multiple instructions


• Logic to determine true dependencies involving register values
• Mechanisms to communicate these values
• Mechanisms to initiate multiple instructions in parallel
• Resources for parallel execution of multiple instructions
• Mechanisms for committing process state in correct order
INSTRUCTION ISSUE POLICIES

• Order in which instructions are fetched

• Order in which instructions are executed

• Order in which instructions update registers and memory values (order of completion)

Standard Categories:
• In-order issue with in-order completion

• In-order issue with out-of-order completion

• Out-of order issue with out-of-order completion


DYNAMIC PIPELINING

• In the scalar pipeline there are state registers (buffers) in between the
stages
• Using superscalar pipeline, a multientry buffer must be used to hold the
data of each instruction to be executed in parallel
DYNAMIC PIPELINING (2)

• In the scalar pipeline, when stall happens, it prevents the data in the buffer to
flow to the next stage
• In superscalar pipeline, stalling all the buffer will cause stalling instructions
which don’t need to be stalled (revise the example shown in slide 17)
• So more complex multientry buffer design is required.
• One enhancement is to add capability to explicitly address each individual entry
in the buffer, and independently control the reading and writing of each entry.
• In superscalar, trailing instructions can bypass leading stalled instruction, which
causes out of order execution, theses superscalars are called Dynamic pipelining
DYNAMIC PIPELINING (3)
SUPERSCALAR EXECUTION
COMMITTING OR RETIRING INSTRUCTIONS

Results need to be put into order (commit or retire)

• Results sometimes must be held in temporary storage until it is


certain they can be placed in “permanent” storage.
(either committed or retired/flushed)

• Temporary storage requires regular clean up – overhead – done


in hardware.
EXAMPLE

• A superscalar pipeline capable of fetching and decoding 2 instructions at a time


• ‣ Instructions are fetched in pair. the next two instructions must wait until the pair of decode pipeline
stages has cleared.

• - having 3 separate function units (e.g., two integer arithmetic and one floating-point
arithmetic)
• - 2 instances of the write-back pipeline stage
• - 6 instruction code fragment with the following constraints:
• ‣ I1 requires two cycles to execute
• ‣ I3 and I4 conflict for the same functional unit (e.g., both need floating-point arithmetic) ‣ I5 depends on
the value produced by I4
• ‣ I5 and I6 conflict for a functional unit

• - When there is a conflict for a functional unit, or when a functional unit requires more than
one cycle to generate a result, instructions temporarily stall.
IN-ORDER ISSUE -- IN-ORDER COMPLETION

Issue instructions in the order they occur:

• Not very efficient

• Instructions must stall if necessary (and stalling in superscalar


is expensive)
IN-ORDER ISSUE -- IN-ORDER
COMPLETION (EXAMPLE)
Assume:
• I1 requires 2 cycles to execute
• I3 & I4 conflict for the same functional unit
• I5 depends upon value produced by I4
• I5 & I6 conflict for a functional unit
IN-ORDER ISSUE -- OUT-OF-ORDER COMPLETION
(EXAMPLE)

Again:
• I1 requires 2 cycles to execute
• I3 & I4 conflict for the same functional unit
• I5 depends upon value produced by I4
• I5 & I6 conflict for a functional unit
OUT-OF-ORDER ISSUE -- OUT-OF-ORDER COMPLETION

• Decouple decode pipeline from execution pipeline

• Can continue to fetch and decode until the “window” is full

• When a functional unit becomes available an instruction can be executed


(usually in as much in-order as possible)

• Since instructions have been decoded, processor can look ahead


OUT-OF-ORDER ISSUE -- OUT-OF-ORDER
COMPLETION (EXAMPLE)

Again:
• I1 requires 2 cycles to execute
• I3 & I4 conflict for the same functional unit
• I5 depends upon value produced by I4
• I5 & I6 conflict for a functional unit
Note: I5 depends upon I4, but I6 does not
SOME ARCHITECTURES

• PowerPC 604
• six independent execution units:
• Branch execution unit
• Load/Store unit
• 3 Integer units
• Floating-point unit
• in-order issue
• register renaming
• Power PC 620
• provides in addition to the 604 out-of-order issue
• Pentium
• three independent execution units:
• 2 Integer units
• Floating point unit
• in-order issue
ASSIGNMENT 2

Search and write a report for 5 real superscalar


processors (other than mentioned in the slides), make
comparison between them in the number of
processing elements and its diversification, and the
order of issue, execution of the programs

Das könnte Ihnen auch gefallen