ILP Saad Saeed

Instruction Level Parallelism ILP
Advanced Computer Architecture

CSE 8383
Spring 2004 2/19/2004
Presented By:
Sa’ad Al-Harbi
Saeed Abu Nimeh
Outline
 What’s ILP
 ILP vs Parallel Processing
 Sequential execution vs ILP execution
 Limitations of ILP
 ILP Architectures
 Sequential Architecture
 Dependence Architecture
 Independence Architecture
 ILP Scheduling
 Open Problems
 References
What’s ILP
 Architectural technique that allows the
overlap of individual machine operations
( add, mul, load, store …)
 Multiple operations will execute in parallel
(simultaneously)
 Goal: Speed Up the execution
 Example:
load R1  R2 add R3  R3, “1”
add R3  R3, “1” add R4  R3, R2
add R4  R4, R2 store [R4]  R0
Example: Sequential vs ILP
 Sequential execution (Without ILP)
Add r1, r2  r8 4 cycles
Add r3, r4  r7 4 cycles 8 cycles
 ILP execution (overlap execution)

Add r1, r2  r8
Add r3, r4  r7
Total of 5 cycles
ILP vs Parallel Processing
ILP Parallel Processing
 Overlap individual machine  Having separate processors
operations (add, mul, load…) getting separate chunks of
so that they execute in the program ( processors
parallel programmed to do so)
 Transparent to the user  Nontransparent to the user
 Goal: speed up execution  Goal: speed up and quality

up
ILP Challenges
 In order to achieve parallelism we
should not have dependences among
instructions which are executing in
parallel:
 H/W terminology Data Hazards ( RAW,
WAR, WAW)
 S/W terminology Data Dependencies
Dependences and Hazards
 Dependences are a property of
programs
 If two instructions are data dependent
they can not execute simultaneously
 A dependence results in a hazard and
the hazard causes a stall
 Data dependences may occur through
registers or memory
Types of Dependencies
 Name dependencies
 Output dependence
 Anti-dependence
 Data True dependence
 Control Dependence
 Resource Dependence
Name dependences
 Output dependence
When instruction I and J write the same register or
memory location. The ordering must be preserved to
leave the correct value in the register
add r7,r4,r3
div r7,r2,r8
 Anti-dependence
When instruction j writes a register or memory
location that instruction I reads
i: add r6,r5,r4
j: sub r5,r8,r11
Data Dependences
 An instruction j is data
dependent on instruction i LOOP LD F0, 0(R1)
if either of the following
hold: ADD F4, F0, F2
 instruction i produces a
result that may be used by SD F4, 0(R1)
instruction j , or
 instruction j is data SUB R1, R1, -8
dependent on instruction k,
and instruction k is data
dependent on instruction i BNE R1, R2, LOOP
Control Dependences
 A control dependence determines the ordering of an instruction
i, with respect to a branch instruction so that the instruction i is
executed in correct program order.
 Example: Two constraints imposed by control
If p1 { dependences:
S1;
};
1. An instruction that is control dependent on a
branch cannot be moved before the branch
If p2 {
S2; 2. An instruction that is not control dependent
}; on a branch cannot be moved after the branch
Resource dependences
 An instruction is resource-dependent on
a previously issued instruction if it
requires a hardware resource which is
still being used by a previously issued
instruction.
 e.g.
 div r1, r2, r3

 div r4, r2, r5
ILP Architectures
 Computer Architecture: is a contract
(instruction format and the interpretation of
the bits that constitute an instruction)
between the class of programs that are
written for the architecture and the set of
processor implementations of that
architecture.
 In ILP Architectures: + information
embedded in the program pertaining to
available parallelism between instructions and
operations in the program
ILP Architectures
Classifications
 Sequential Architectures: the program is not
expected to convey any explicit information regarding
parallelism. (Superscalar processors)
 Dependence Architectures: the program explicitly
indicates the dependences that exist between
operations (Dataflow processors)
 Independence Architectures: the program provides
information as to which operations are independent
of one another. (VLIW processors)
Sequential architecture and
superscalar processors
 Program contains no explicit information
regarding dependencies that exist between
instructions
 Dependencies between instructions must be
determined by the hardware
 It is only necessary to determine dependencies
with sequentially preceding instructions that have
been issued but not yet completed
 Compiler may re-order instructions to
facilitate the hardware’s task of extracting
parallelism
Superscalar Processors
 Superscalar processors attempt to issue
multiple instructions per cycle
 However, essential dependencies are
specified by sequential ordering so
operations must be processed in sequential
order
 This proves to be a performance bottleneck
that is very expensive to overcome
Dependence architecture and
data flow processors
 The compiler (programmer) identifies the parallelism
in the program and communicates it to the hardware
(specify the dependences between operations)
 The hardware determines at run-time when each
operation is independent from others and perform
scheduling
 Here, no scanning of the sequential program to
determine dependences
 Objective: execute the instruction at the earliest
possible time (available input operands and
functional units).
Dependence architectures
Dataflow processors
 Dataflow processors are representative of
Dependence architectures
 Execute instruction at earliest possible time subject
to availability of input operands and functional units
 Dependencies communicated by providing with each
instruction a list of all successor instructions
 As soon as all input operands of an instruction are
available, the hardware fetches the instruction
 The instruction is executed as soon as a functional
unit is available
 Few Dataflow processors currently exist
Dataflow strengths and
limitations
 Dataflow processors use control parallelism
alone to fully utilize the FU.
 Dataflow processor is more successful than
others at looking far down the execution path
to find control parallelism
 When successful its better than speculative
execution:
 Every instruction is executed is useful
 Processor does not have to deal with error
conditions, because of speculative operations
Independence architecture
and VLIW processors
 By knowing which operations are independent, the
hardware needs no further checking to determine
which instructions can be issued in the same cycle
 The set of independent operations >> the set of
dependent operations
 Only a subset of independent operations are specified
 The compiler may additionally specify on which
functional unit and in which cycle an operation is
executed
 The hardware needs to make no run-time decisions
VLIW processors
 Operation vs instruction
 Operation: is an unit of computation (add, load,
branch = instruction in sequential ar.)
 Instruction: set of operations that are intended to
be issued simultaneously
 Compiler decides which operation to go to
each instruction (scheduling)
 All operations that are supposed to begin at
the same time are packaged into a single
VLIW instruction
VLIW strengths
 In hardware it is very simple:
 consisting of a collection of function units (adders,
multipliers, branch units, etc.) connected by a
bus, plus some registers and caches
 More silicon goes to the actual processing
(rather than being spent on branch
prediction, for example),
 It should run fast, as the only limit is the
latency of the function units themselves.
 Programming a VLIW chip is very much like
writing microcode
VLIW limitations
 The need for a powerful compiler,
 Increased code size arising from aggressive
scheduling policies,
 Larger memory bandwidth and register-file
bandwidth,
 Limitations due to the lock-step operation,
binary compatibility across implementations
with varying number of functional units and
latencies
Summary: ILP Architectures
Sequential Dependence Independence
Architecture Architecture Architectures
Additional info None Specification of Minimally, a partial list

required in the dependences of independences. A
program between operations complete specification
of when and where
each operation to be
executed
Typical kind of ILP Superscalar Dataflow VLIW
processor
Dependences Performed by HW Performed by Performed by

analysis compiler compiler
Independences Performed by HW Performed by HW Performed by

analysis compiler
Scheduling Performed by HW Performed by HW Performed by

compiler
Role of compiler Rearranges the code Replaces some Replaces virtually all
to make the analysis analysis HW the analysis and
and scheduling HW scheduling HW
more successful
ILP Scheduling
Static Scheduling boosted Dynamic Scheduling Dynamic Scheduling
by parallel code without static parallel code boosted by static parallel
optimization optimization code optimization
 done by the compiler  done by the processor  done by processor in

 The processor receives  The code is not conjunction with parallel
dependency-free and optimized for parallel optimizing compiler
optimized code for execution. The  The processor receives
parallel execution processor detects and optimized code for
 Typical for VLIWs and a resolves dependencies parallel execution, but it
few pipelined on its own detects and resolves
processors (e.g. MIPS)  Early ILP processors dependencies on its own
(e.g. CDC 6600, IBM  Usual practice for
360/91 etc.) pipelined and
superscalar processors
(e.g. RS6000)
ILP Scheduling: Trace
scheduling
 An optimization technique that has been
widely used for VLIW, superscalar, and
pipelined processors.
 It selects a sequence of basic blocks as a
trace and schedules the operations from the
trace together.
 Example:
Instr1
Instr2
Branch x
Instr3
Trace Scheduling
 Extract more ILP
 Increase machine fetch bandwidth by
storing logically consecutive blocks in
physically contiguous cache location
(possible to fetch multiple basic blocks
in one cycle)
 Trace scheduling can be implemented
by hardware or software
Trace Scheduling in HW
 Hardware technique makes use of a large
amount of information in dynamic execution
to format traces dynamically and schedule the
instructions in trace more efficiently.
 Since the dependency and memory access
addresses have been solved in dynamic
execution, instructions in trace can be
reordered more easily and efficiently.
 Example: trace cache approach
Trace scheduling in SW
 Supplement to machines without
hardware trace scheduling support.
 Formats traces based on static profiled
data, and schedules instructions using
traditional compiler scheduling and
optimization technique.
 It faces some difficulties like code
explosion and exception handling.
ILP open problems
 Pipelined scheduling : Optimized scheduling of pipelined
behavioral descriptions. Two simple type of pipelining
(structural and functional).
 Controller cost : Most scheduling algorithms do not consider the
controller costs which is directly dependent on the controller
style used during scheduling.
 Area constraints : The resource constrained algorithms could
have better interaction between scheduling and floorplanning.
 Realism :
 Scheduling realistic design descriptions that contain several special
language constructs.
 Using more realistic libraries and cost functions.
 Scheduling algorithms must also be expanded to incorporate
different target architectures.
References
 Instruction-Level Parallel Processing: History, Overview and Perspective. B. Ramakrishna Rau, Joseph A.
Fisher. Journal of Supercomputing, Vol. 7, No. 1, Jan. 1993, pages 9-50.
 Limits of Control Flow on Parallelism. Monica S. Lam, Robert P. Wilson. 19th ISCA, May 1992, pages 19-
21.
 Global Code Generation for Instruction-Level Parallelism: Trace Scheduling-2. Joseph A. Fisher.
Technical Report, HPLabs HPL-93-43, Jun. 1993.
 VLIW at IBM Research

http://www.research.ibm.com/vliw
 Intel and HP hope to speed CPUs with VLIW technology that's riskier than RISC, Dick Pountain
http://www.byte.com/art/9604/sec8/art3.htm
 Hardware and Software Trace Scheduling

http://charlotte.ucsd.edu/users/yhu/paperlist/summary.html
 ILP open problems

http://www.ececs.uc.edu/~ddel/projects/dss/hls_paper/node9.html
 Computer Architecture A Quantitative Approach, Hennessy & Patterson, 3 rd edition, M Kaufmann

ILP Saad Saeed

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

ILP Saad Saeed

Hochgeladen von

Copyright:

Verfügbare Formate

Instruction Level Parallelism ILP

Advanced Computer Architecture

 ILP execution (overlap execution)

 Transparent to the user  Nontransparent to the user

 Goal: speed up execution  Goal: speed up and quality

 div r1, r2, r3

Additional info None Specification of Minimally, a partial list

Dependences Performed by HW Performed by Performed by

Independences Performed by HW Performed by HW Performed by

Scheduling Performed by HW Performed by HW Performed by

 done by the compiler  done by the processor  done by processor in

 VLIW at IBM Research

 Hardware and Software Trace Scheduling

 ILP open problems

 Computer Architecture A Quantitative Approach, Hennessy & Patterson, 3 rd edition, M Kaufmann

Das könnte Ihnen auch gefallen