Sie sind auf Seite 1von 33

Rehan Azmat

Lecture26-27
SIMD Architecture

Introduction and Motivation


Architecture classification
Performance of Parallel Architectures
Interconnection Network

Array processors
Vector processors
Cray X1
Multimedia extensions

Manipulation of arrays or vectors is a common operation in


scientific and engineering applications.

Typical operations of array-oriented data include:

Processing one or more vectors to produce a scalar result.


Combining two vectors to produce a third one.
Combining a scalar and a vector to generate a vector.
A combination of the above three operations.

Two architectures suitable for vector processing have evolved:


Pipelined vector processors

Implemented in many supercomputers

Parallel array processors

Compiler does some of the difficult work of finding parallelism,


so the hardware doesnt have to.
Data parallelism.

Strictly speaking, vector processors are not parallel processors.

There are not several CPUs in a vector processor, running in parallel.

Vector computers usually have vector registers which can store each 64
up to 128 words.

Vector instructions examples:

They only behave like SIMD computers.

They are SISD processors with vector instructions executed on pipelined functional
units.

Load vector from memory into vector register


Store vector into memory
Arithmetic and logic operations between vectors
Operations between vectors and scalars

The programmers are allowed to use operations on vectors in the


programs, and the compiler translates these operations into vector
instructions at machine level.

A vector unit typically consists of

Vector registers:

pipelined functional units


vector registers

n general purpose vector registers Ri, 0 i n-1;


vector length register VL: stores the length l (0 l s),
of the currently processed vector; s is the length of the

vector registers.
mask register M: stores a set of l bits, one for each
element in a vector, interpreted as Boolean values;

vector instructions can be executed in masked mode so that


vector register elements corresponding to a false value in M
are ignored.

Consider an element-by-element addition of two N-element vectors A


and B to create the sum vector C.

On an SISD machine, this computation will be implemented as:


for i = 0 to N-1 do
C[i] := A[i] + B[i];
There will be N*K instruction fetches (K instructions are needed for each iteration)
and N additions.
There will also be N conditional branches, if loop unrolling is not used.

A compiler for a vector computer generates something like:


C[0:N-1] A[0:N-1] + B[0:N-1];
Even though N additions will still be performed, there will only be K instruction
fetches (e.g., Load A, Load B, Add, Write C = 4 instructions).
No conditional branch is needed.

Advantages

Memory-to-memory operation mode

Register-to-register operations are more common


with new machines.

Quick fetch and decode of a single instruction for multiple


operations.
The instruction provides the processor with a regular
source of data, which can arrive at each cycle, and be
processed in a pipelined fashion regularly.
The compiler does the work for you.
no registers.
can process very long vectors, but start-up time is large.
appeared in the 70s and died in the 80s.

It is composed of N identical processing elements under the


control of a single control unit and a number of memory
modules.
The PEs execute instruction in a lock-step mode.

Processing units and memory elements communicate with each


other through an interconnection network.
Different topologies can be used.

Complexity of the control unit is at the same level of the


uniprocessor system.

Control unit is a computer with its own high speed registers,


local memory and arithmetic logic unit.

The main memory is the aggregate of the memory modules.

Processing element complexity


Single-bit processors
Connection Machine (CM-2) 65536 PEs connected by a
hypercube network (by Thinking Machine Corporation).

Multi-bit processors
ILLIAC IV (64-bit), MasPar MP-1 (32-bit)

Processor-memory interconnection
Dedicated memory organization
ILLIAC IV, CM-2, MP-1

Global memory organization


Bulk Synchronous Parallel (BSP) computer

Control and scalar type instructions are executed in the


control unit.

Vector instructions are performed in the processing


elements.

Data structuring and detection of parallelism in a program


are the major issues in the application of array processor.

Operations such as C(i) = A(i) B(i), 1 i n could be


executed in parallel, if the elements of the arrays A and B
are distributed properly among the processors or memory
modules.
Ex. PEi is assigned the task of computing C(i).

To compute
Assuming:
A dedicated memory organization.
Elements of A and B are properly and perfectly distributed
among processors (the compiler can help here).
We have:
The product terms are generated in parallel.
Additions can be performed in log2N iterations.

Speed up factor (S) is:

ILLIAC IV development started in the late 60s; fully


operational in 1975.

SIMD computer for array processing.

Control Unit + 64 Processing Elements.

CU can access all memory.

PEs can access local memory and communicate with


neighbors.

CU reads program and broadcasts instructions to PEs.

2K words memory per PE.

Cray combines several technologies in the X1


(2003)
12.8 Gflop/s vector processors
Shared caches
4 processor nodes sharing up to 64 GB of memory

Multi-streaming vector processing


Multiple node architecture

MSP: Multi-Streaming vector Processor


Formed by 4 SSPs (each a 2-pipe vector processor)
Balance computations across SSPs.
Compiler will try to vectorize/parallelize across the
MSP, achieving streaming

Many levels of parallelism

Some are automated by the compiler, some


require work by the programmer

Within a processor: vectorization


Within an MSP: streaming
Within a node: shared memory
Across nodes: message passing

This is a common trend


The more complex the architecture, the more difficult it
is for the programmer to exploit it

Hard to fit this machine into a simple taxonomy!

How do we extend general purpose microprocessors so that they can


handle multimedia applications efficiently.
Analysis of the need:

Video and audio applications very often deal with large arrays of small
data types (8 or 16 bits).

Such applications exhibit a large potential of SIMD (vector) parallelism.


Data parallelism.

Solutions:

New generations of general purpose microprocessors are equipped with


special instructions to exploit this parallelism.

The specialized multimedia instructions perform vector computations on


bytes, half-words, or words.

Several vendors have extended the instruction set of their


processors in order to improve performance with
multimedia applications:

MMX for Intel x86 family


VIS for UltraSparc
MDMX for MIPS
MAX-2 for Hewlett-Packard PA-RISC

The Pentium line provides 57 MMX instructions. They treat


data in a SIMD fashion to improve the performance of

Computer-aided design
Internet application
Computer visualization
Video games
Speech recognition

The basic idea: sub-word execution

Use the entire width of a processor data path (32 or


64 bits), even when processing the small data types
used in signal processing (8, 12, or 16 bits).

With word size 64 bits, an adder can be used to


implement eight 8-bit additions in parallel.

MMX technology allows a single instruction to work


on multiple pieces of data.

Consequently we have practically a kind of SIMD


parallelism, at a reduced scale.

Three packed data types are defined for


parallel operations: packed byte, packed
word, packed double word.

The following shows the performance of


Pentium processors with and without MMX
technology:

Vector processors are SISD processors which include in their instruction


set instructions operations on vectors.
They are implemented using pipelined functional units.
They behave like SIMD machines.

Array processors, being typical SIMD, execute the same operation on a


set of interconnected processing units.

Both vector and array processors are specialized for numerical problems
expressed in matrix or vector formats.

Many modern architectures deploy usually several parallel architecture


concepts at the same time, such as Cray X1.

Multimedia applications exhibit a large potential of SIMD parallelism.

The instruction set of modern microprocessors has been extended to support SIMDstyle parallelism with operations on short vectors.

End of the Lecture

Das könnte Ihnen auch gefallen