Session2v2 PDF

Parallel Processors
Session 2
Multiprocessor and Multicomputer Models
Speedup
For a fixed problem size (input data set):
1
Performance =
time
Performance (p processors)
Speedup (p processors) =
Performance (1 processor)
Time (1 processor)
Speedup (p processors) =
Time (p processors)
Parallel Computers (MIMD)
• Two major classes:
1. Shared-memory multiprocessors
2. Message-passing multicomputers
• Major difference:
– Memory sharing
– Inter processor communication mechanism
Memory Sharing
• Multiprocessor systems:
– All processors have access to a common
memory
• Multicomputer systems:
– Each computer node has a local memory
which is not shared with other nodes
Processor Communication
• Multiprocessor systems:
– Processors communicate through shared
variables in a common memory
• Multicomputer systems:
– Processors communicate through message
passing among the nodes
Shared-Memory Multiprocessors
• Three models of shared-memory
multiprocessors:
1. Uniform Memory Access (UMA)
2. Nonuniform Memory Access (NUMA)
3. Cache-Only Memory Architecture (COMA)
The UMA Model
Processors
P1 P2 … Pn
System Interconnect
(Bus, Crossbar, Multistage Network)
I/O SM1 … SMm

Shared Memory
• The physical memory is uniformly shared by all the processors

• All processors have equal access time to all memory words
• Each processor may use a private cache
• To coordinate parallel events, synchronization and communication
between processors are done through using shared variables in the
common memory
More on the UMA Model
• Symmetric Multiprocessors: All processors have
equal access to all peripherals and can be
executive or master processor to run OS kernel
and I/O service routines
• Asymmetric Multiprocessors: Only one or some
of the processors are executive capable, other
processors are called attached processors (AP)
• Attached processors execute user code under
the supervision of the master processor (MP)
• In both MP and AP configurations, memory
sharing among master and attached processors
is still in place
The NUMA Model
LM 1 P1
LM 2 P2 Interconnection
Network
…
…
…
LM n Pn
• The shared memory is physically distributed to all processors as local

memories
• The collection of all local memories forms a global address space
accessible by all processors
• The memory access time varies with the location of the memory word
• It is faster to access a local memory with a local processor
• The access of remote memory attached to other processors takes longer
More on the NUMA Model
• Globally shared memory can be added to
a multiprocessor system
• There are three memory access
patterns:
1. Local memory access, the fastest
2. Global memory access
3. Remote memory access, the slowest
A Hierarchical Hybrid Model
P P … P GSM
Cluster Interconnect Network

GSM
CSM CSM … CSM
Cluster 1
…
GIN
…
P P … P
CIN
CSM CSM … CSM GSM
Cluster n Global Interconnect Network
• Processors are divided into several clusters

• Each cluster is itself a UMA or NUMA multiprocessor
• The clusters are connected to global shared memory modules
• The entire system is considered a NUMA multiprocessor
• All clusters have equal access to global memory
• The access time to global memory is longer than the access time to cluster memory
The COMA Model
Interconnection Network
D D D Directory
C C … C Cache
P P P Processor
• In this model the multiprocessor uses cache-only memory

• Especial case of the NUMA model in which the distributed main
memories are converted to caches
• There is no memory hierarchy in each processor node
• All the caches form a global address space
• Remote cache is assisted by the distributed cache directories
Scalability
• Multiprocessor systems are suitable for general-
purpose multiuser applications
• Programmability is the major concern in this
case
• Multiprocessors are not easily scalable
• A multiprocessor with centralized shared
memory is complex
• In the distributed case the latency accessing the
remote data is a major limitation
Distributed –Memory
Multicomputers
M M
…
P P
M P P M
Message-Passing
…
…
Interconnection
Network
M P P M
P P
…
M M
• Multiple computers (nodes)

• Interconnected by a message-passing network
• Each node is an autonomous computer
• Consisting of a processor, local memory, and attached disks and I/O
peripherals
Distributed –Memory
Multicomputers
M M
…
P P
M P P M
Message-Passing
…
…
Interconnection
Network
M P P M
P P
…
M M
• The message-passing network provides point-to-point static connections

among the nodes
• All local memories are private and are accessible only by local processors
• Also called no-remote-memory-access (NORMA)
• Internode communication is carried out by passing messages through the
static connection network
Multicomputer Generations
• First Generation (1983-1987)
– Board technology
– Hypercube architecture
– Software-controlled message switching
• Second Generation (1988-1992)
– Mesh-connected architecture
– Hardware message routing
– Software environment for distributed computing
• Third Generation
– Processor and communication gears on the same VLSI chip
– Hardware routers to pass messages
– A computer node is attached to each router
– The boundary router may be attached to I/O and peripheral devices
– Mixed types of nodes are allowed in a heterogeneous multicomputer
– Internode communications are achieved through compatible data
representations and message-passing protocols
Important Issues
• Static network topology (ring, tree, mesh,
hypercube …)
• Communication patterns (one-to-one,
broadcasting, permutations, multicast)
• Message routing schemes
• Network flow control strategies
• Deadlock avoidance
• Virtual channels
• Message-passing primitives
• Program decomposition techniques
Programmability Issues
• Distribution of computations and data sets
by message passing and establishment of
efficient communication among nodes
makes programming difficult
• Intelligent compilers and efficient
distributed operating systems are needed
Multivector and SIMD Computers
• Pipelined vector machines using a few
powerful processors equipped with vector
hardware
• SIMD computers with massive data
parallelism
Vector Supercomputers
Vector
Scalar Instructions
Processor
Scalar Instructions Vector

Data Processor
Main Memory Vector
Program and Data Data
Mass Host
Storage Computer I/O
• A vector computer is built on top of a scalar processor

• The vector processor is attached to the scalar processor
as an optional feature
Vector and Scalar Operations
Scalar Processor
Scalar Functional
Pipeline
Vector Processor
Vector
Control Unit Control Unit
Vector Func. Pipe.

…
Data Registers
Vector
Main Memory Data Vector Func. Pipe.
Program and Data
• Program and data are first loaded into the main memory through a host computer
• All instructions are first decoded by the scalar control unit
• If the instruction is a scalar operation it will be executed by the scalar processor
• If the instruction is decoded as a vector operation, it will be sent to the vector control unit
• The vector control unit will supervise the flow of vector data between main memory and vector
functional pipelines
• A number of vector pipeline functional units may be built into a vector processor
Vector Processor Models
• Register-to-Register Vector Processors

• Memory-to-Memory Vector Processors
Register-to-Register Architecture
Vector Processor
Vector
Instructions
Vector
Control Unit
Vector Func. Pipe.

Vector Vector
…
Data Registers
Vector Func. Pipe.
• Vector registers are used to hold vector operands and intermediate and final
results
• The vector functional pipelines retrieve operands from the vector registers
• Results are written back into the vector registers by the vector functional
pipelines
• The length of each vector register is usually fixed
• In some cases the length is reconfigurable
• In general there are fixed number of vector registers and functional
pipelines in a vector processor
• Both resources must be reserved in advance to avoid resource conflicts
between different vector operations
Memory-to-Memory Architecture
• A vector stream unit replaces the vector
registers
• Vector operands and results are directly
retrieved from the main memory in
superwords (for example 512 bits)
SIMD Supercomputers
Control Unit
Proc. 0 Proc. 1 Proc. N-1
Mem. 0 Mem. 1 Mem. N-1
PE 0 PE 1 PE N-1
Interconnection Network
An operational model of SIMD supercomputers is specified by:
1. The number of processing elements (PE), for example 64, 65536, …

2. The set of instructions directly executed by the control unit (Scalar and program flow control
instructions)
3. The set of instructions broadcast by the CU to all PEs for parallel execution (includes
arithmetic, logic, data routing, and other local operations executed by each active PE over data
within that PE
4. The set of masking schemes (each mask partitions the set of PEs into enabled and disabled
subsets
5. The set of data routing functions (specifying various patterns to be setup in the interconnection
network for inter-PE communications
Example: MasPar MP-1
1. The PM-1 is an SIMD machine with N=1024 to 16384 PEs for
different configurations
2. The CU executes scalar instructions, broadcasts decoded vector
instructions to the PE array, and controls the inter-PE
communications
3. Each PE is a register-based load/store RISC processor capable of
executing integer operations over various data sizes and standard
floating-point operations. The PE receives the instructions from
the CU
4. The masking scheme is built within each PE and continuously
monitored by the CU which can set and reset the status of each
PE dynamically at run time
5. The interconnection network is a mesh network plus a global
multistage crossbar router for inter-CU-PE communications
Parallel Computer Models
• Theoretical models are derived from
physical models
• These models are used by algorithm
designers
• To avoid implementation details
• The models can be used to obtain
theoretical performance bounds
PRAM
RAM: Random Access Machine
• Model for Conventional uniprocessor computers
PRAM: Parallel Random Access Machine
• Model for idealized parallel computers
– Zero synchronization overhead
– Zero memory access overhead
• This model is used for
– Parallel algorithm development
– Scalability analysis
– Complexity analysis
PRAM Model
PRAM with n processors:
• Globally addressable memory P1
• The shared memory can be:
– Centralized in one place
– Distributed among processors P2
Totally Shared
• Processors operate in a cycle: Synchronized
Memory
– Read memory
…
– Compute
– Write memory
• The operation is synchronized
Pn
Shared Memory Access Methods in
PRAM Models
How concurrent read and concurrent write of memory are
handled?
Four options:
• ER: Exclusive Read
– At most one processor can read from a memory location in a cycle
• EW: Exclusive Write
– At most one processor can write into a memory location in a cycle
• CR: Concurrent Read
– Multiple processors can read the same information from the same
memory location in one cycle
• CW: Concurrent Write
– Multiple processors can write to the same memory location in one
cycle
• Policy needed to resolve the write conflicts
PRAM Variants
• EREW-PRAM Model:
– Exclusive Read, Exclusive Write
• Forbids more than one processor from reading or
writing the same memory cell simultaneously
• CREW-PRAM Model:
– Concurrent Read, Exclusive Write
• ERCW-PRAM Model:
– Exclusive Read, Concurrent Write
• CRCW-PRAM Model:
– Concurrent Read, Concurrent Write
Concurrent Write Conflicts
• Policies to resolve conflicting writes:
• Common:
– All simultaneous writes store the same value to the hot-
spot memory location
• Arbitrary:
– Any one of the values written may remain and is
acceptable, the others are ignored
• Minimum:
– The value with the minimum index will remain
• Priority:
– The values being written are combined using some
associative functions such as summation or maximum

Session2v2 PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Session2v2 PDF

Hochgeladen von

Copyright:

Verfügbare Formate

Parallel Processors

I/O SM1 … SMm

• The physical memory is uniformly shared by all the processors

• The shared memory is physically distributed to all processors as local

Cluster Interconnect Network

CSM CSM … CSM GSM

Cluster n Global Interconnect Network

• Processors are divided into several clusters

• In this model the multiprocessor uses cache-only memory

• Multiple computers (nodes)

• The message-passing network provides point-to-point static connections

Scalar Instructions Vector

• A vector computer is built on top of a scalar processor

Vector Func. Pipe.

• Register-to-Register Vector Processors

Vector Func. Pipe.

Proc. 0 Proc. 1 Proc. N-1

Mem. 0 Mem. 1 Mem. N-1

1. The number of processing elements (PE), for example 64, 65536, …

Das könnte Ihnen auch gefallen