Lecture 8 Unit 4 Pipeline and Vector Processing 2019

PIPELINING
AND
VECTOR PROCESSING
Unit – 4
2019
Dr. Bonomali Khuntia
Department of Computer Science
Berhampur University
PIPELINING AND VECTOR PROCESSING
 Parallel Processing
 Pipelining
 Arithmetic Pipeline
 Instruction Pipeline
 RISC Pipeline
 Vector Processing
 Array Processors
Parallel Processing
CONVENTIONAL COMPUTER SYSTEMS

Control Processor Data stream Memory
Unit Unit
Instruction stream
Characteristics
- Standard von Neumann machine
- Instructions and data are stored in memory
- One operation at a time
Limitations
Von Neumann bottleneck
Maximum speed of the system is limited by the Memory Bandwidth

(bits/sec or bytes/sec)
- Limitation on Memory Bandwidth

- Memory is shared by CPU and I/O
Parallel Processing
PARALLEL PROCESSING
 Parallel processing denotes techniques used to
provide simultaneous data-processing tasks for
increasing the computational speed of computer
system.
 A parallel processing system is able to perform
concurrent data processing to achieve faster
execution time.
 The system may have two or more ALUs and be able
to execute two or more instructions at the same time.
Parallel Processing
PARALLEL PROCESSING
 A multifunctional organization is usually associated
with a complex control unit to coordinate all the
activities among the various components.
Adder-sub tractor
Integer multiply
Logic unit
Shift unit
To Memory Incrementer
Processor
Floating-point
register Add-subtract
Floating-point
multiply
Floating-point
divide
Parallel Processing
PARALLEL COMPUTERS
Flynn's Classification
 Based on the multiplicity of Instruction Streams and Data Streams
 Instruction Stream
 Sequence of Instructions read from memory
 Data Stream
 Operations performed on the data in the processor
 Parallel processing may occur in the instruction stream, the data stream,
or both .
Number of Data Streams
Single Multiple
Number of Single SISD SIMD

Instruction
Streams Multiple MISD MIMD
Parallel Processing
SISD COMPUTER SYSTEMS
Control Processor Data stream

Memory
Unit Unit
Instruction stream
Characteristics:
 One control unit, one processor unit, and one memory unit
 Parallel processing may be achieved by means of:
 multiple functional units
 pipeline processing
Parallel Processing
MISD COMPUTER SYSTEMS
M CU P
M CU P
Memory
• •
• •
• •
M CU P Data stream
Instruction stream
Characteristics
- There is no computer at present that can be classified as MISD
Parallel Processing
SIMD COMPUTER SYSTEMS

Memory
Data bus
Control Unit
Instruction stream
P P ••• P Processor units
Data stream
Alignment network
M M ••• M Memory modules
Characteristics
 Only one copy of the program exists.
 A single controller executes one instruction at a time.
Parallel Processing
MIMD COMPUTER SYSTEMS

P M P M ••• P M
Interconnection Network
Shared Memory
Characteristics:
 Multiple processing units (multiprocessor system)
 Execution of multiple instructions on multiple data
 Most multiprocessor and multicomputer system can be classified

in this category.
 The main difference between multicomputer system and
multiprocessor system is that the multiprocessor system is
controlled by one operating system that provides interaction
between processors and all the component of the system
cooperate in the solution of a problem.
Parallel Processing
Classification Summary
1. SISD: Instructions are executed sequentially. Parallel processing may
be achieved by means of multiple functional units or by pipeline
processing.
2. SIMD: Includes multiple processing units with a single control unit.
All processors receive the same instruction, but operate on different
data.
3. MISD: There is no computer at present that can be classified as MISD
4. MIMD: A computer system capable of processing several programs at
the same time.
We will consider parallel processing under the following

main topics:
o Pipeline processing
o Vector processing
o Array processors
Pipelining
PIPELINING
A technique of decomposing a sequential process into suboperations, with
each sub-process being executed in a partial dedicated segment that operates
concurrently with all other segments.
Ai * Bi + Ci for i = 1, 2, 3, ... , 7
Ai Bi Memory Ci
Segment 1
R1 R2
Multiplier
Segment 2
R3 R4
Adder
Segment 3
R5
R1  Ai, R2  Bi Load Ai and Bi

R3  R1 * R2, R4  Ci Multiply and load Ci
R5  R3 + R4 Add
Pipelining
OPERATIONS IN EACH PIPELINE STAGE

Clock Segment 1 Segment 2 Segment 3
Pulse
Number R1 R2 R3 R4 R5
1 A1 B1
2 A2 B2 A1 * B1 C1
3 A3 B3 A2 * B2 C2 A1 * B1 + C1
4 A4 B4 A3 * B3 C3 A2 * B2 + C2
5 A5 B5 A4 * B4 C4 A3 * B3 + C3
6 A6 B6 A5 * B5 C5 A4 * B4 + C4
7 A7 B7 A6 * B6 C6 A5 * B5 + C5
8 A7 * B7 C7 A6 * B6 + C6
9 A7 * B7 + C7
Pipelining
GENERAL PIPELINE
General Structure of a 4-Segment Pipeline
Clock
Input S1 R1 S2 R2 S3 R3 S4 R4
Space-Time Diagram: Segment utilization as a function of time.

The following diagram shows 6 tasks T1 through T6 executed in 4 segments.
Clock cycles
1 2 3 4 5 6 7 8 9
1 T1 T2 T3 T4 T5 T6
Segment 2 T1 T2 T3 T4 T5 T6 No matter how many
segments, once the pipeline is
3 T1 T2 T3 T4 T5 T6
full, it takes only one clock
4 T1 T2 T3 T4 T5 T6 period to obtain an output.
Pipelining
PIPELINE SPEEDUP
n: Number of tasks to be performed
Conventional Machine (Non-Pipelined)

Clock cycles
tn: Time required to complete each task
t: Time required to complete the n tasks 1 2 3 4 5 6 7 8 9
t = n * tn 1 T1 T2 T3 T4 T5 T6
Segment 2 T1 T2 T3 T4 T5 T6
3 T1 T2 T3 T4 T5 T6
4 T1 T2 T3 T4 T5 T6
Pipelined Machine (k stages)
tp: Clock cycle (time to complete each sub-operation)
tk: Time required to complete the n tasks
tk = (k + n - 1) * tp
Speedup
Sk: Speedup tn
Sk = n*tn / (k + n - 1)*tp lim Sk = ( = k, if tn = k * tp )
n tp
Pipelining
PIPELINE AND MULTIPLE FUNCTION UNITS

Example
- 4-stage pipeline
- subopertion in each stage; tp = 20nS
- 100 tasks to be executed
- 1 task in non-pipelined system; 20*4 = 80nS
Pipelined System
(k + n - 1)*tp = (4 + 99) * 20 = 2060nS
Non-Pipelined System
n*k*tp = 100 * 80 = 8000nS
Speedup
Sk = 8000 / 2060 = 3.88
4-Stage Pipeline is basically identical to the system with 4 identical function units
Multiple Functional Units (SIMD)
Ii Ii+1 Ii+2 Ii+3
P1 P2 P3 P4
Pipelining
Challenges in Pipeline
 Different segments may take different time to

complete their sub-operations.
 The clock cycle must be chosen to equal the time delay
of the segment with the maximum propagation time.
 Its not always correct to assume that a non-pipeline
circuit has the same time delay as that of an equivalent
pipeline circuit.
Arithmetic Pipeline
ARITHMETIC PIPELINE
Exponents Mantissas
a b A B
Floating-point adder R R
[1] Compare the exponents Compare Difference

[2] Align the mantissa Segment 1: exponents
[3] Add/sub the mantissa by subtraction 3-2=1
[4] Normalize the result
R
X = A x 10a = 0.9504 x 103
Y = B x 10b = 0.8200 x 102 Segment 2: Choose exponent Align mantissa
Y = 0.08200 x 10 3
R
1) Compare exponents :
3-2=1 Add or subtract
Segment 3: mantissas
2) Align mantissas
X = 0.9504 x 103 Z = 1.0324 x 10 3
R R
Y = 0.08200 x 103
3) Add mantissas
Adjust Normalize
Z = 1.0324 x 103 Segment 4:
exponent result
4) Normalize result
Z = 0.10324 x 104 R R
Z = 0.10324 x 10 4
Instruction Pipeline
INSTRUCTION PIPELINE
Six Phases* in an Instruction Cycle
[1] Fetch an instruction from memory
[2] Decode the instruction
[3] Calculate the effective address of the operand
[4] Fetch the operands from memory
[5] Execute the operation
[6] Store the result in the proper place
The pipeline may not perform at its maximum rate due to:
Different segments taking different times to operate
Some segment being skipped for certain operations
Memory access conflicts
==> 4-Stage Pipeline
[1] FI: Fetch an instruction from memory

[2] DA: Decode the instruction and calculate the effective address of the operand
[3] FO: Fetch the operand
[4] EX: Execute the operation
INSTRUCTION PIPELINE
Execution of Three Instructions in a 4-Stage Pipeline
Conventional
i FI DA FO EX
i+1 FI DA FO EX
i+2 FI DA FO EX
Pipelined
i FI DA FO EX
i+1 FI DA FO EX
i+2 FI DA FO EX
INSTRUCTION EXECUTION IN A 4-STAGE PIPELINE

Segment1: Fetch instruction
from memory
Decode instruction
Segment2: and calculate There are some difficulties that
effective address
will prevent the instruction pipeli
Branch? ne from operating at its maximu
yes
no m rate.
Fetch operand
Segment3: from memory
Segment4: Execute instruction
Interrupt yes
Interrupt?
handling
no
Update PC
Empty pipe
Step: 1 2 3 4 5 6 7 8 9 10 11 12 13
Instruction 1 FI DA FO EX
2 FI DA FO EX
(Branch) 3 FI DA FO EX
4 FI FI DA FO EX
5 FI DA FO EX
6 FI DA FO EX
7 FI DA FO EX
Reasons for the pipeline to deviate from its normal operation are:
 Resource conflicts caused by access to memory by two

segments at the same time.
 Can be resolved by using separate instruction and data memories.
 Data dependency conflicts arise when an instruction depends

on the result of a previous instruction, but this result is not yet
available.
Example: an instruction with register indirect mode cannot
proceed to fetch the operand if the previous instruction is
loading the address into the register.
 Branch difficulties arise from program control instructions that

may change the value of PC.
Methods to handle data dependency:
 Hardware interlocks are circuits that detect instructions whose

source operands are destinations of prior instructions.
Detection causes the hardware to insert the required delays
without altering the program sequence.
 Operand forwarding uses special hardware to detect a conflict
and then avoid it by routing the data through special paths
between pipeline segments. This requires additional hardware
paths through multiplexers as well as the circuit to detect the
conflict.
 Delayed load is a procedure that gives the responsibility for
solving data conflicts to the compiler. The compiler is designed
to detect a data conflict and reorder the instructions as
necessary to delay the loading of the conflicting data by
inserting no-operation instructions.
Methods to handle branch instructions:

 Prefetching the target instruction in addition to the next instruction
allows either instruction to be available.
 A branch target buffer(BTB) is an associative memory included in the
fetch segment of the branch instruction that stores the target instruction for a
previously executed branch. It also stores the next few instructions after the
branch target instruction. This way, the branch instructions that have
occurred previously are readily available in the pipeline without
interruption.
 The loop buffer is a variation of the BTB. It is a small very high speed
register file maintained by the instruction fetch segment of the pipeline. It
stores all branches within a loop segment.
 Branch prediction uses some additional logic to guess the outcome of a
conditional branch instruction before it is executed. The pipeline then
begins prefetching instructions from the predicted path.
 Delayed branch is used in most RISC processors so that the compiler
rearranges the instructions to delay the branch.
Vector Processing
VECTOR PROCESSING
 Vector Processor (computer): Ability to process vectors,
and related data structures such as matrices and multi-
dimensional arrays, much faster than conventional
computers.
 Vector Processors may also be pipelined.
Vector Processing Applications

Problems that can be efficiently formulated in terms of vectors
1. Long-range weather forecasting
2. Petroleum explorations
3. Seismic data analysis
4. Medical diagnosis
5. Aerodynamics and space flight simulations
6. Artificial intelligence and expert systems
7. Mapping the human genome
8. Image processing
Vector Processing
VECTOR OPERATIONS
A vector is an ordered set of one-dimensional array of data items.
DO 20 I = 1, 100
20 C(I) = B(I) + A(I)
Conventional computer
Initialize I = 0
20 Read A(I)
Read B(I)
Store C(I) = A(I) + B(I)
Increment I = I+ 1
If I  100 goto 20
Vector computer
C(1:100) = A(1:100) + B(1:100)
Vector Instruction Format
Operation Base address Base address Base address Vector

code Source 1 Source 2 destination length
Vector Processing
Matrix Multiplication
Total number of multiplications or additions required to compute the matrix product

is 9x3=27.
The inner product consists of the sum of k product terms of the form:
Source
A
Source Multiplier Adder

B pipeline pipeline
Pipeline for Inner Product

Vector Processing
 All segment registers in the multiplier and adder are initialized to zero.
 After the first four cycles, the product begins to be added to the output of
the adder.
 During the next four cycles, 0is added to the four products entering the
adder pipeline.
 At the end of eight cycle, the first four products are in the four adder
segments.
 At the beginning of the ninth cycle, the output of the adder is A1B1 and the
output of the multiplier is A5B5 . The tenth cycle starts the addition A2B2 +
A6B6 and so on.
Source
A

B pipeline pipeline

Vector Processing
 When there are no product terms to be added, the system
inserts four 0s into the multiplier pipeline.
 The adder pipeline will have one partial product in each of its
four segments as in equation below. The four partial sums are
then added to form the final sum.
Source
A

B pipeline pipeline

Vector Processing
MEMORY INTERLEAVING
 Pipeline and vector processors often require
simultaneous access to memory from two or more
sources.
 An instruction pipeline may require the fetching of
an instruction and an operand at the same time from
two different segments.
 An arithmetic pipeline usually requires two or more
operands to enter the pipeline at the same time.
 Instead of using two memory buses for
simultaneous access, the memory can be partitioned
into a number of modules connected to common
memory address and data buses.
Vector Processing
MEMORY INTERLEAVING (Contd…)

 A memory module is a memory array together with its own address and data
register.
 The advantage of modular memory is that it allows the use of
technique called Interleaving.
 In an interleaved memory, different sets of addresses are assigned to
different memory modules. Address bus
M0 M1 M2 M3
AR AR AR AR
Memory Memory Memory Memory

array array array array
DR DR DR DR
Address Interleaving Data bus
 Different sets of addresses are assigned to different memory modules

 For example, in a two-module memory system, the even addresses may be in one module and
the odd addresses in the other.
Array Processor
Array processor
 An array processor is a processor that is employed to
compute on large arrays of data. The term is used to
refer to two different types of processors.
 An attached array processor
 SIMD Array Processor
 Both are used to manipulate vectors, but they differ in
internal organizations.
Array Processor
An attached array processor

 It’s an auxiliary processor attached to a general
purpose computer.
 It’s purpose is to enhance the performance of the host
computer by providing vector processing for complex
scientific applications.
 Parallel processing is achieved by multiple functional
units.
General-Purpose input-output Attached array

computer interface processor
High-speed memory to
Main memory Local memory
Memory bus
Array Processor
SIMD Array Processor

 It’s a Single Instruction Multiple Data organization.
 It manipulates vector instructions by means of
multiple functional units responding to a common
instruction under the control of a common control
unit.
PE1 M1
Master control
unit
PE2 M2
PE3 M3
Main memory
PEn Mn
Array Processor
SIMD Array Processor (Contd…)
 Each processing element(PE) includes an ALU, a

floating point arithmetic unit, and working registers.
 The main memory is used for storage of the program.
 The function of master control unit is to decode the
instructions and determine how the instruction is to
be executed. It controls the operation in the PE.
 Scalar and program control instructions are directly
executed in the master control unit.
 Vector instructions are broadcast to all PEs
simultaneously.
THANK YOU

Lecture 8 Unit 4 Pipeline and Vector Processing 2019

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Lecture 8 Unit 4 Pipeline and Vector Processing 2019

Hochgeladen von

Copyright:

Verfügbare Formate

PIPELINING

CONVENTIONAL COMPUTER SYSTEMS

Maximum speed of the system is limited by the Memory Bandwidth

- Limitation on Memory Bandwidth

Number of Single SISD SIMD

SISD COMPUTER SYSTEMS

Control Processor Data stream

MISD COMPUTER SYSTEMS

SIMD COMPUTER SYSTEMS

P P ••• P Processor units

M M ••• M Memory modules

MIMD COMPUTER SYSTEMS

 Most multiprocessor and multicomputer system can be classified

We will consider parallel processing under the following

R1  Ai, R2  Bi Load Ai and Bi

OPERATIONS IN EACH PIPELINE STAGE

Space-Time Diagram: Segment utilization as a function of time.

Conventional Machine (Non-Pipelined)

PIPELINE AND MULTIPLE FUNCTION UNITS

Multiple Functional Units (SIMD)

Ii Ii+1 Ii+2 Ii+3

 Different segments may take different time to

[1] Compare the exponents Compare Difference

==> 4-Stage Pipeline

[1] FI: Fetch an instruction from memory

Execution of Three Instructions in a 4-Stage Pipeline

INSTRUCTION EXECUTION IN A 4-STAGE PIPELINE

Segment4: Execute instruction

 Resource conflicts caused by access to memory by two

 Data dependency conflicts arise when an instruction depends

 Branch difficulties arise from program control instructions that

Methods to handle data dependency:

 Hardware interlocks are circuits that detect instructions whose

Methods to handle branch instructions:

Vector Processing Applications

A vector is an ordered set of one-dimensional array of data items.

Vector Instruction Format

Operation Base address Base address Base address Vector

Total number of multiplications or additions required to compute the matrix product

Source Multiplier Adder

Pipeline for Inner Product

Source Multiplier Adder

Pipeline for Inner Product

Source Multiplier Adder

Pipeline for Inner Product

MEMORY INTERLEAVING (Contd…)

Memory Memory Memory Memory

Address Interleaving Data bus

 Different sets of addresses are assigned to different memory modules

An attached array processor

General-Purpose input-output Attached array

SIMD Array Processor

SIMD Array Processor (Contd…)

 Each processing element(PE) includes an ALU, a

Das könnte Ihnen auch gefallen