BEC30303 Chapter 6: Pipelining Performance Analysis

Faculty of Electrical and Electronic Engineering
Semester II, Session 2016/2017
BEC30303
Computer Architecture and Organization
Chapter 6:
Pipelining
Mohamad Hairol Jabbar
Department of Computer Engineering
http://fkee.uthm.edu.my/mhjabbar
OUTLINE
1. Pipeline organization
2. Data dependencies
3. Pipeline issues
4. Branch delays
5. Superscalar operation
6. Performance evaluation
Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 2

1. PIPELINE ORGANIZATION
IMPROVE EXECUTION OF PROGRAMS
• Use faster circuit technology to build the
processor and the main memory.
• Arrange the hardware so that more than one
operation can be performed at the same time.
• In the latter way, the number of operations
performed per second is increased even
though the elapsed time needed to perform
any one operation is not changed.

LAUNDRY EXAMPLE
•Ann, Brian, Cathy, Dave
each have one load of clothes
to wash, dry, and fold
•Washer takes 30 minutes
•Dryer takes 40 minutes
•Folder takes 20 minutes

SEQUENTIAL CONCEPT
6 PM 7 8 9 10 11 Midnight
Time
30 40 20 30 40 20 30 40 20 30 40 20
A • Sequential laundry takes 6

hours for 4 loads
Task • If they learned pipelining, how
order B long would laundry take?

PIPELINE CONCEPT
• Pipelining doesn’t help latency
of single task, it helps
throughput of entire workload
• Pipeline rate limited by slowest
pipeline stage
• Multiple tasks operating
simultaneously using different
resources
• Potential speedup = Number
pipe stages
• Unbalanced lengths of pipe
stages reduces speedup
• Time to “fill” pipeline and time
to “drain” it reduces speedup
• Stall for Dependences
• Pipelined laundry takes 3.5

hours for 4 loads
• What is the speedup? = 6/3.5

LATENCY VS THROUGHPUT
• Latency:…
• Throughput:…
• Speedup due to increased throughput
– Latency (time for each instruction) does not
decrease

NON-PIPELINED DESIGN
• Single-cycle implementation:
– The cycle time depends on the slowest instruction
– Every instruction takes the same amount of time
• Multi-cycle implementation:
– Divide the execution of an instruction into multiple
steps
– Each instruction may take variable number of
steps (clock cycles)

SINGLE CYCLE
• The cycle time depends on the slowest
instruction
• Every instruction takes the same amount of
time
Problem:
- Long cycle time (means what?), to finish each instruction
Source: fetweb.ju.edu.jo

MULTI CYCLE
• Divide the execution of an instruction into
multiple steps
• Each instruction may take variable number of
steps (clock cycles)
Problem:
- improve the clock cycle period, but some instructions
finish longer than the single cycle implementation Source: fetweb.ju.edu.jo

PIPELINE IMPLEMENTATION
Ideal multi cycle implementation with

pipeline
Advantages:
- Once the pipeline is full, we get CPI = 1 (means what?)

SINGLE, MULTI, PIPELINE
Lw- load word

Sw- store word

REALISTIC PIPELINE EXECUTION
Each stage executes

within the same clock
period/cycle, but
some stages
complete early
1 clk cycle 1 clk cycle 1 clk cycle

CPI, IPC
• CPI:
– Cycle per instruction
– Smaller is better
– For ideal case (all steps finish in 1 clock cycle),
thus CPI = 1
– Realistic case, some steps take multi cycle
execution, thus use average CPI
• IPC:
– Instruction per cycle
– = 1/CPI

CPI EXAMPLE
• A program executes:
– ALU, Load, Store
– Cycles per instruction type: ALU = 1, Load = 2,
Store = 3
• What is the CPI?:

– (33% * 1) + (33% * 2) + (33% * 3) = 2
– Also = total no. of cycles / total no. of instructions
Source: http://www.cis.upenn.edu/

PIPELINE PERFORMANCE
• Time to start executing the fourth
instruction:
• Single-Cycle = 1000 x 3 =
3000 ps,
• Pipelining = 3 x 200 = 600
ps(we fetch one instruction
every clock cycle
• Speedup = 3000/600 = 5
instructions over single
cycle
• Total execution time for three

Example: sequence
load instructions:
of Load instructions
• Single-cycle = 3 x 1000 =
3000 ps, clock cycle?
• Pipelined = 1200 ps, clock
cycle?
• Speedup = ???

2. DATA DEPENDENCIES
DATA DEPENDENCE
• Data dependence:
– one instruction is dependent on another
instruction to provide its operands.
• Control dependence (aka branch
dependences):
– one instruction determines whether another
instruction gets executed or not.
• Control dependences are particularly critical
with conditional branches.

EXAMPLE OF DATA DEPENDENCIES
• Control dependence will always

creates control hazard, why?
• Data dependence not
necessarily creates data hazard
Source: http://cseweb.ucsd.edu/

EXAMPLE
Instruction:

3. PIPELINE ISSUES
HAZARDS
• Any condition that causes a pipeline to stall is
called a hazard.
• Pipeline stall causes degradation in pipeline
performance.

TYPES OF HAZARDS
• Data hazard – any condition in which either
the source or the destination operands of an
instruction are not available at the time
expected in the pipeline.
• Structural hazard – the situation when two
instructions require the use of a given
hardware resource at the same time.
• Instruction (control) hazard – a delay in the
availability of an instruction causes the
pipeline to stall.

DATA HAZARD
• Data/operands is not ready when it is needed
1 2 3 4 5 6 7
Read-after-write hazard Source: www.cs.utexas.edu/

TYPES OF DATA HAZARD
• RAW (read after write)
– only hazard for ‘fixed’ pipelines
– later instruction must read after earlier write
• WAW (write after write)
– variable-length pipeline
– later instruction must write after earlier write
• WAR (write after read)
– pipelines with late read
– later instruction must write after earlier read

HANDLING DATA HAZARDS
• In software:
– insert independent instructions (or no-ops), by the
compiler
• In hardware:
– insert bubbles (i.e. stall the pipeline): solve all
hazards
– data forwarding: sometimes does not solve
hazards

INSERTING NOP (1/2)
• Let the compiler detect and handle the
hazard:
add $1, x, x
NOP
NOP
sub $4, $1, $5
add $6, $1, $7
...
• The compiler can reorder the instructions to
perform some useful work during the NOP
slots.
INSERTING NOP (2/2)
1 2 3 4 5 6 7

STALL THE PIPELINE
What will
happen to
the CPI?
1 2 3 4 5 6 7
Source: www.cs.utexas.edu/

OPERAND FORWARDING (1/2)
• Instead of from the register file, the second
instruction can get data directly from the
output of ALU after the previous instruction is
completed.
• A special arrangement needs to be made to
“forward” the output of ALU to the input of
ALU.

OPERAND FORWARDING (2/2)
1 2 3 4 5 6 7

EXAMPLE

FORWARDING HARDWARE
Result is forwarded to
the ALU needed for next
instruction execution

FORWARDING PROBLEM
1 2 3 4 5 6 7
Still must stall for 1 clock

cycle (for instruction #2)

STRUCTURAL HAZARDS
• The situation when two instructions require the use
of a given hardware resource at the same time.
Structural
hazard due to
the accessing
the same
registers at
the same time
Clock cycle 1 2 3 4 5 6 7 Source: www.cs.utexas.edu/

FIXING STRUCTURAL HAZARDS
• Fix using:
– Stall
– Separate memory architecture for instruction and
data or separate the memory access time

FIXING THE HAZARD
1 2 3 4 5 6 7

CONTROL HAZARD
• Whenever the stream of instructions supplied
by the instruction fetch unit is interrupted, the
pipeline stalls.
• Examples of interruption:
– Cache miss
– Branch instruction

CONTROL HAZARD
1 2 3 4 5 6 7

FIXING CONTROL HAZARD
1 2 3 4 5 6 7
Stall (inserting
bubbles) the pipeline
for several clock
cycles before the real
instruction is fetched

4. BRANCH DELAYS
BRANCH TYPES
• Unconditional, example?
• Conditional, example?

CONDITIONAL BRANCHES
• A conditional branch instruction introduces
the added hazard caused by the dependency
of the branch condition on the result of a
preceding instruction.
• The decision to branch cannot be made until
the execution of that instruction has been
completed.
• Branch instructions represent about 20% of
the dynamic instruction count of most
programs.

BRANCH PENALTY
1 2 3 4 5 6 7
3 cycles penalty for branch instruction


REDUCING BRANCH PENALTY
• For unconditional branch:
– Compute the branch target address earlier in the
pipeline
– Instruction prefetching
– Delayed branch
– Branch prediction

BRANCH TARGET ADDRESS
Determine the branch

target address during
decode stage – save 1
clock cycle
Determine the branch

target address during
execute stage - 2 clock
cycles branch penalty

INSTRUCTION QUEUE/PREFETCHING
Instruction fetch unit

Instruction queue
F:
instruction
Fetch
Memory access is slow,
thus prefetch the
instructions to avoid
long delay (waiting)
when branch happens
D : Dispatch/
Decode E : ecute W:
instruction
Ex results
Write
unit

DELAYED BRANCH (1/2)
• The instructions in the delay slots are always
fetched. Therefore, we would like to arrange
for them to be fully executed whether or not
the branch is taken.
• The objective is to place useful instructions in
these slots.
• The effectiveness of the delayed branch
approach depends on how often it is possible
to reorder instructions.

DELAYED BRANCH (2/2)
• Scheduling techniques:
– The compiler statically schedules an independent
instruction in the branch delay slot.
• The instruction in the branch delay slot is
executed whether or not the branch is taken

EXAMPLE
• A previous add instruction with any effects on
the branch is scheduled in the branch delay
slot
Source: http://home.deib.polimi.it/

BRANCH NOT TAKEN
• If the branch is not taken, the execution
continues with the next instruction

BRANCH TAKEN
• If the branch is taken = execution continues
at the branch target, after the delayed
instruction slot is executed

BRANCH PREDICTION (1/2)
• To predict whether or not a particular branch
will be taken.
• Simplest form:
– assume branch will not take place and continue to
fetch instructions in sequential address order.
• Until the branch is evaluated, instruction
execution along the predicted path must be
done on a speculative basis.
• Speculative execution:
– instructions are executed before the processor is
certain that they are in the correct execution
sequence.
BRANCH PREDICTION (2/2)
• Better performance can be achieved if we arrange
for some branch instructions to be predicted as
taken and others as not taken.
• Use hardware to observe whether the target
address is lower or higher than that of the branch
instruction.
• Let compiler include a branch prediction bit.
• So far the branch prediction decision is always the
same every time a given instruction is executed –
static branch prediction.

BRANCH NOT TAKEN
• For branch not taken assumption
When it is right, no
penalty (no waste
clock cycles)

BRANCH TAKEN
• When branch is taken
Same performance as
stalling when it is wrong

PREDICTION METHODS
• Fixed:
– Prediction is fixed
– Example: branch-never-taken
– Not proper for loop structures
• Static branch prediction:
– base guess on instruction types
• Dynamic branch prediction:
– base guess on execution history

DYNAMIC BRANCH PREDICTION
• Using branch predictors to save the history of
branch executions
Source: http://www.owlnet.rice.edu/

5. SUPERSCALAR OPERATION
OVERVIEW
• The maximum throughput of a pipelined
processor is one instruction per clock cycle.
• If we equip the processor with multiple
processing units to handle several
instructions in parallel in each processing
stage, several instructions start execution in
the same clock cycle – multiple-issue.
• Processors are capable of achieving an
instruction execution throughput of more than
one instruction per cycle – superscalar
processors.
• Multiple-issue requires a wider path to the
cache and multiple execution units.
SCALAR AND SUPERSCALAR
Source: W. M. Johnson, 1989


SUPERSCALAR OPERATION
• Increases the ability of the processor to use
instruction level parallelism
• Multiple instructions are issued every cycle:
– multiple pipelines or functional units operating in
parallel
• Example:
– 3 parallel pipelines
– 3-way superscalar processor
– 3-issue processor

EXECUTION ORDER
• In order execution:
– Instructions are fetched, executed and completed in
compiler generated order
– One stalls, they all stall
– Instructions are statistically scheduled
• Out-of-order execution:
– Instructions are fetched in compiler-generated order
– Instruction completion may be in-order or out-of-order
– In between they may be executed in some other order
– Instructions are dynamically scheduled

PIPELINE EVOLUTION
• Basic pipeline:
– Single issue, in-order issue
• First extension:
– Multiple issue (superscalar)
– In-order issue
• Second extension:
– Multiple issue (superscalar)
– Out-of-order issue

EXAMPLE
Source: www.ida.liu.se/

IOI, IOC
• In-order issue, in-order completion
In order In order
execution completion

OOI, OOC
• Out-of-order issue, out-of-order completion
Out of order Out of order

execution completion

EXECUTION COMPLETION
• It is desirable to use out-of-order execution,
so that an execution unit is freed to execute
other instructions as soon as possible.
• At the same time, instructions must be
completed in program order to allow precise
exceptions.
• Using temporary registers

SUPER COMPUTER AT FKEE

SUPER COMPUTER AT FKEE

QUESTION: WHICH ONE IS THE FASTEST?
GPU? Microprocessor? CPU?

6. PERFORMANCE EVALUATION
PIPELINE PERFORMANCE (1/3)
• The execution time, T of a program that has a
dynamic instruction count N is given by:
where S is the average number of clock cycles it

takes to fetch and execute one instruction, and R is
the clock rate.
• Instruction throughput is defined as the
number of instructions executed per second.

• An n-stage pipeline has the potential to
increase the throughput by n times.
• However, the only real measure of
performance is the total execution time of a
program.
• Higher instruction throughput will not
necessarily lead to higher performance in
terms of program’s execution time.
• Question regarding pipelining:
– What is good value of n?

• Since an n-stage pipeline has the potential to
increase the throughput by n times, how
about we use a 10,000-stage pipeline?
• As the number of stages increase, the
probability of the pipeline being stalled
increases.
• The inherent delay in the basic operations
increases.
• Hardware considerations (area, power,
complexity, etc.)

FINISH

BEC30303 Chapter 6: Pipelining Performance Analysis

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

BEC30303 Chapter 6: Pipelining Performance Analysis

Hochgeladen von

Copyright:

Verfügbare Formate

Faculty of Electrical and Electronic Engineering

Semester II, Session 2016/2017

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 2

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 4

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 5

A • Sequential laundry takes 6

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 6

• Pipelined laundry takes 3.5

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 7

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 8

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 9

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 10

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 11

Ideal multi cycle implementation with

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 12

Lw- load word

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 13

Each stage executes

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 14

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 15

• What is the CPI?:

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 16

• Total execution time for three

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 17

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 19

• Control dependence will always

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 20

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 21

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 23

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 24

Read-after-write hazard Source: www.cs.utexas.edu/

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 25

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 26

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 27

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 29

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 30

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 31

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 32

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 33

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 34

Still must stall for 1 clock

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 35

Clock cycle 1 2 3 4 5 6 7 Source: www.cs.utexas.edu/

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 36

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 37

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 38

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 39

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 40

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 41

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 43

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 44

3 cycles penalty for branch instruction

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 45

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 46

Determine the branch

Determine the branch

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 47

Instruction fetch unit

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 48

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 49

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 50

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 51