Sie sind auf Seite 1von 36

PIPELINING

AND
VECTOR PROCESSING

Unit – 4
2019
Dr. Bonomali Khuntia
Department of Computer Science
Berhampur University
PIPELINING AND VECTOR PROCESSING

 Parallel Processing

 Pipelining

 Arithmetic Pipeline

 Instruction Pipeline

 RISC Pipeline

 Vector Processing

 Array Processors
Parallel Processing

CONVENTIONAL COMPUTER SYSTEMS


Control Processor Data stream Memory
Unit Unit

Instruction stream

Characteristics
- Standard von Neumann machine
- Instructions and data are stored in memory
- One operation at a time

Limitations
Von Neumann bottleneck

Maximum speed of the system is limited by the Memory Bandwidth


(bits/sec or bytes/sec)

- Limitation on Memory Bandwidth


- Memory is shared by CPU and I/O
Parallel Processing

PARALLEL PROCESSING
 Parallel processing denotes techniques used to
provide simultaneous data-processing tasks for
increasing the computational speed of computer
system.
 A parallel processing system is able to perform
concurrent data processing to achieve faster
execution time.
 The system may have two or more ALUs and be able
to execute two or more instructions at the same time.
Parallel Processing

PARALLEL PROCESSING
 A multifunctional organization is usually associated
with a complex control unit to coordinate all the
activities among the various components.
Adder-sub tractor

Integer multiply

Logic unit

Shift unit

To Memory Incrementer

Processor
Floating-point
register Add-subtract

Floating-point
multiply

Floating-point
divide
Parallel Processing

PARALLEL COMPUTERS

Flynn's Classification
 Based on the multiplicity of Instruction Streams and Data Streams

 Instruction Stream
 Sequence of Instructions read from memory

 Data Stream
 Operations performed on the data in the processor

 Parallel processing may occur in the instruction stream, the data stream,
or both .
Number of Data Streams
Single Multiple

Number of Single SISD SIMD


Instruction
Streams Multiple MISD MIMD
Parallel Processing

SISD COMPUTER SYSTEMS

Control Processor Data stream


Memory
Unit Unit

Instruction stream

Characteristics:
 One control unit, one processor unit, and one memory unit
 Parallel processing may be achieved by means of:
 multiple functional units
 pipeline processing
Parallel Processing

MISD COMPUTER SYSTEMS

M CU P

M CU P
Memory
• •
• •
• •

M CU P Data stream

Instruction stream

Characteristics
- There is no computer at present that can be classified as MISD
Parallel Processing

SIMD COMPUTER SYSTEMS


Memory
Data bus

Control Unit
Instruction stream

P P ••• P Processor units

Data stream

Alignment network

M M ••• M Memory modules

Characteristics
 Only one copy of the program exists.
 A single controller executes one instruction at a time.
Parallel Processing

MIMD COMPUTER SYSTEMS


P M P M ••• P M

Interconnection Network

Shared Memory
Characteristics:
 Multiple processing units (multiprocessor system)
 Execution of multiple instructions on multiple data

 Most multiprocessor and multicomputer system can be classified


in this category.
 The main difference between multicomputer system and
multiprocessor system is that the multiprocessor system is
controlled by one operating system that provides interaction
between processors and all the component of the system
cooperate in the solution of a problem.
Parallel Processing

Classification Summary
1. SISD: Instructions are executed sequentially. Parallel processing may
be achieved by means of multiple functional units or by pipeline
processing.
2. SIMD: Includes multiple processing units with a single control unit.
All processors receive the same instruction, but operate on different
data.
3. MISD: There is no computer at present that can be classified as MISD
4. MIMD: A computer system capable of processing several programs at
the same time.

We will consider parallel processing under the following


main topics:
o Pipeline processing
o Vector processing
o Array processors
Pipelining

PIPELINING
A technique of decomposing a sequential process into suboperations, with
each sub-process being executed in a partial dedicated segment that operates
concurrently with all other segments.

Ai * Bi + Ci for i = 1, 2, 3, ... , 7
Ai Bi Memory Ci
Segment 1
R1 R2

Multiplier
Segment 2
R3 R4

Adder
Segment 3

R5

R1  Ai, R2  Bi Load Ai and Bi


R3  R1 * R2, R4  Ci Multiply and load Ci
R5  R3 + R4 Add
Pipelining

OPERATIONS IN EACH PIPELINE STAGE


Clock Segment 1 Segment 2 Segment 3
Pulse
Number R1 R2 R3 R4 R5
1 A1 B1
2 A2 B2 A1 * B1 C1
3 A3 B3 A2 * B2 C2 A1 * B1 + C1
4 A4 B4 A3 * B3 C3 A2 * B2 + C2
5 A5 B5 A4 * B4 C4 A3 * B3 + C3
6 A6 B6 A5 * B5 C5 A4 * B4 + C4
7 A7 B7 A6 * B6 C6 A5 * B5 + C5
8 A7 * B7 C7 A6 * B6 + C6
9 A7 * B7 + C7
Pipelining

GENERAL PIPELINE
General Structure of a 4-Segment Pipeline
Clock

Input S1 R1 S2 R2 S3 R3 S4 R4

Space-Time Diagram: Segment utilization as a function of time.


The following diagram shows 6 tasks T1 through T6 executed in 4 segments.

Clock cycles

1 2 3 4 5 6 7 8 9
1 T1 T2 T3 T4 T5 T6
Segment 2 T1 T2 T3 T4 T5 T6 No matter how many
segments, once the pipeline is
3 T1 T2 T3 T4 T5 T6
full, it takes only one clock
4 T1 T2 T3 T4 T5 T6 period to obtain an output.
Pipelining

PIPELINE SPEEDUP
n: Number of tasks to be performed

Conventional Machine (Non-Pipelined)


Clock cycles
tn: Time required to complete each task
t: Time required to complete the n tasks 1 2 3 4 5 6 7 8 9
t = n * tn 1 T1 T2 T3 T4 T5 T6
Segment 2 T1 T2 T3 T4 T5 T6
3 T1 T2 T3 T4 T5 T6
4 T1 T2 T3 T4 T5 T6
Pipelined Machine (k stages)
tp: Clock cycle (time to complete each sub-operation)
tk: Time required to complete the n tasks
tk = (k + n - 1) * tp

Speedup
Sk: Speedup tn
Sk = n*tn / (k + n - 1)*tp lim Sk = ( = k, if tn = k * tp )
n tp
Pipelining

PIPELINE AND MULTIPLE FUNCTION UNITS


Example
- 4-stage pipeline
- subopertion in each stage; tp = 20nS
- 100 tasks to be executed
- 1 task in non-pipelined system; 20*4 = 80nS

Pipelined System
(k + n - 1)*tp = (4 + 99) * 20 = 2060nS
Non-Pipelined System
n*k*tp = 100 * 80 = 8000nS
Speedup
Sk = 8000 / 2060 = 3.88
4-Stage Pipeline is basically identical to the system with 4 identical function units

Multiple Functional Units (SIMD)

Ii Ii+1 Ii+2 Ii+3

P1 P2 P3 P4
Pipelining

Challenges in Pipeline

 Different segments may take different time to


complete their sub-operations.
 The clock cycle must be chosen to equal the time delay
of the segment with the maximum propagation time.
 Its not always correct to assume that a non-pipeline
circuit has the same time delay as that of an equivalent
pipeline circuit.
Arithmetic Pipeline

ARITHMETIC PIPELINE
Exponents Mantissas
a b A B

Floating-point adder R R

[1] Compare the exponents Compare Difference


[2] Align the mantissa Segment 1: exponents
[3] Add/sub the mantissa by subtraction 3-2=1
[4] Normalize the result
R
X = A x 10a = 0.9504 x 103
Y = B x 10b = 0.8200 x 102 Segment 2: Choose exponent Align mantissa
Y = 0.08200 x 10 3
R
1) Compare exponents :
3-2=1 Add or subtract
Segment 3: mantissas
2) Align mantissas
X = 0.9504 x 103 Z = 1.0324 x 10 3
R R
Y = 0.08200 x 103
3) Add mantissas
Adjust Normalize
Z = 1.0324 x 103 Segment 4:
exponent result
4) Normalize result
Z = 0.10324 x 104 R R

Z = 0.10324 x 10 4
Instruction Pipeline

INSTRUCTION PIPELINE
Six Phases* in an Instruction Cycle
[1] Fetch an instruction from memory
[2] Decode the instruction
[3] Calculate the effective address of the operand
[4] Fetch the operands from memory
[5] Execute the operation
[6] Store the result in the proper place

The pipeline may not perform at its maximum rate due to:
Different segments taking different times to operate
Some segment being skipped for certain operations
Memory access conflicts

==> 4-Stage Pipeline

[1] FI: Fetch an instruction from memory


[2] DA: Decode the instruction and calculate the effective address of the operand
[3] FO: Fetch the operand
[4] EX: Execute the operation
Instruction Pipeline

INSTRUCTION PIPELINE

Execution of Three Instructions in a 4-Stage Pipeline

Conventional

i FI DA FO EX

i+1 FI DA FO EX

i+2 FI DA FO EX

Pipelined

i FI DA FO EX

i+1 FI DA FO EX
i+2 FI DA FO EX
Instruction Pipeline

INSTRUCTION EXECUTION IN A 4-STAGE PIPELINE


Segment1: Fetch instruction
from memory

Decode instruction
Segment2: and calculate There are some difficulties that
effective address
will prevent the instruction pipeli
Branch? ne from operating at its maximu
yes
no m rate.
Fetch operand
Segment3: from memory

Segment4: Execute instruction

Interrupt yes
Interrupt?
handling
no
Update PC

Empty pipe
Step: 1 2 3 4 5 6 7 8 9 10 11 12 13
Instruction 1 FI DA FO EX
2 FI DA FO EX
(Branch) 3 FI DA FO EX
4 FI FI DA FO EX
5 FI DA FO EX
6 FI DA FO EX
7 FI DA FO EX
Instruction Pipeline

Reasons for the pipeline to deviate from its normal operation are:

 Resource conflicts caused by access to memory by two


segments at the same time.
 Can be resolved by using separate instruction and data memories.

 Data dependency conflicts arise when an instruction depends


on the result of a previous instruction, but this result is not yet
available.
Example: an instruction with register indirect mode cannot
proceed to fetch the operand if the previous instruction is
loading the address into the register.

 Branch difficulties arise from program control instructions that


may change the value of PC.
Instruction Pipeline

Methods to handle data dependency:

 Hardware interlocks are circuits that detect instructions whose


source operands are destinations of prior instructions.
Detection causes the hardware to insert the required delays
without altering the program sequence.
 Operand forwarding uses special hardware to detect a conflict
and then avoid it by routing the data through special paths
between pipeline segments. This requires additional hardware
paths through multiplexers as well as the circuit to detect the
conflict.
 Delayed load is a procedure that gives the responsibility for
solving data conflicts to the compiler. The compiler is designed
to detect a data conflict and reorder the instructions as
necessary to delay the loading of the conflicting data by
inserting no-operation instructions.
Instruction Pipeline

Methods to handle branch instructions:


 Prefetching the target instruction in addition to the next instruction
allows either instruction to be available.
 A branch target buffer(BTB) is an associative memory included in the
fetch segment of the branch instruction that stores the target instruction for a
previously executed branch. It also stores the next few instructions after the
branch target instruction. This way, the branch instructions that have
occurred previously are readily available in the pipeline without
interruption.
 The loop buffer is a variation of the BTB. It is a small very high speed
register file maintained by the instruction fetch segment of the pipeline. It
stores all branches within a loop segment.
 Branch prediction uses some additional logic to guess the outcome of a
conditional branch instruction before it is executed. The pipeline then
begins prefetching instructions from the predicted path.
 Delayed branch is used in most RISC processors so that the compiler
rearranges the instructions to delay the branch.
Vector Processing

VECTOR PROCESSING
 Vector Processor (computer): Ability to process vectors,
and related data structures such as matrices and multi-
dimensional arrays, much faster than conventional
computers.
 Vector Processors may also be pipelined.

Vector Processing Applications


Problems that can be efficiently formulated in terms of vectors
1. Long-range weather forecasting
2. Petroleum explorations
3. Seismic data analysis
4. Medical diagnosis
5. Aerodynamics and space flight simulations
6. Artificial intelligence and expert systems
7. Mapping the human genome
8. Image processing
Vector Processing

VECTOR OPERATIONS

A vector is an ordered set of one-dimensional array of data items.

DO 20 I = 1, 100
20 C(I) = B(I) + A(I)
Conventional computer

Initialize I = 0
20 Read A(I)
Read B(I)
Store C(I) = A(I) + B(I)
Increment I = I+ 1
If I  100 goto 20
Vector computer
C(1:100) = A(1:100) + B(1:100)

Vector Instruction Format

Operation Base address Base address Base address Vector


code Source 1 Source 2 destination length
Vector Processing

Matrix Multiplication

Total number of multiplications or additions required to compute the matrix product


is 9x3=27.
The inner product consists of the sum of k product terms of the form:

Source
A

Source Multiplier Adder


B pipeline pipeline

Pipeline for Inner Product


Vector Processing

Matrix Multiplication
 All segment registers in the multiplier and adder are initialized to zero.
 After the first four cycles, the product begins to be added to the output of
the adder.
 During the next four cycles, 0is added to the four products entering the
adder pipeline.
 At the end of eight cycle, the first four products are in the four adder
segments.
 At the beginning of the ninth cycle, the output of the adder is A1B1 and the
output of the multiplier is A5B5 . The tenth cycle starts the addition A2B2 +
A6B6 and so on.

Source
A

Source Multiplier Adder


B pipeline pipeline

Pipeline for Inner Product


Vector Processing

Matrix Multiplication
 When there are no product terms to be added, the system
inserts four 0s into the multiplier pipeline.
 The adder pipeline will have one partial product in each of its
four segments as in equation below. The four partial sums are
then added to form the final sum.

Source
A

Source Multiplier Adder


B pipeline pipeline

Pipeline for Inner Product


Vector Processing

MEMORY INTERLEAVING
 Pipeline and vector processors often require
simultaneous access to memory from two or more
sources.
 An instruction pipeline may require the fetching of
an instruction and an operand at the same time from
two different segments.
 An arithmetic pipeline usually requires two or more
operands to enter the pipeline at the same time.
 Instead of using two memory buses for
simultaneous access, the memory can be partitioned
into a number of modules connected to common
memory address and data buses.
Vector Processing

MEMORY INTERLEAVING (Contd…)


 A memory module is a memory array together with its own address and data
register.
 The advantage of modular memory is that it allows the use of
technique called Interleaving.
 In an interleaved memory, different sets of addresses are assigned to
different memory modules. Address bus
M0 M1 M2 M3

AR AR AR AR

Memory Memory Memory Memory


array array array array

DR DR DR DR

Address Interleaving Data bus

 Different sets of addresses are assigned to different memory modules


 For example, in a two-module memory system, the even addresses may be in one module and
the odd addresses in the other.
Array Processor

Array processor
 An array processor is a processor that is employed to
compute on large arrays of data. The term is used to
refer to two different types of processors.
 An attached array processor
 SIMD Array Processor
 Both are used to manipulate vectors, but they differ in
internal organizations.
Array Processor

An attached array processor


 It’s an auxiliary processor attached to a general
purpose computer.
 It’s purpose is to enhance the performance of the host
computer by providing vector processing for complex
scientific applications.
 Parallel processing is achieved by multiple functional
units.

General-Purpose input-output Attached array


computer interface processor

High-speed memory to
Main memory Local memory
Memory bus
Array Processor

SIMD Array Processor


 It’s a Single Instruction Multiple Data organization.
 It manipulates vector instructions by means of
multiple functional units responding to a common
instruction under the control of a common control
unit.

PE1 M1
Master control
unit
PE2 M2

PE3 M3
Main memory
PEn Mn
Array Processor

SIMD Array Processor (Contd…)

 Each processing element(PE) includes an ALU, a


floating point arithmetic unit, and working registers.
 The main memory is used for storage of the program.
 The function of master control unit is to decode the
instructions and determine how the instruction is to
be executed. It controls the operation in the PE.
 Scalar and program control instructions are directly
executed in the master control unit.
 Vector instructions are broadcast to all PEs
simultaneously.
THANK YOU

Das könnte Ihnen auch gefallen