Sie sind auf Seite 1von 22

Parallel Processing

sp2016
lec#7
Dr M Shamim Baig

1.1

Problems:
Explicit Parallel Architectures

1.2

Example Problem1:
Bus based SM-Multiprocessor
Limit of Parallelism
Consider a SM-Multiprocessor using
32-bit RISC processors running at 150
MHz, carries out one instruction per
clock cycle. Assume 15% data-load &
10% data-store instructions using
shared data Bus having 2GB/sec BW.
Compute Max number of processors
possible to connect on the above Bus
for following parallel configurations:1.3

Example Problems:
Bus based SM-Multiprocessor:
Limit of Parallelism.contd
(a) SMP (without cache memory)
(b) SMP with cache memory
having hit-ratio of 95% &
memory write-through policy
(c) NUMA with program Locality
factor = 80 %
1.4

SMP (SM & Shared Bus IN

Bus-based interconnects (a) with no local caches; (b) with local memory/caches.
Since much of the data accessed by processors is local to the processor, a
local memory can improve the performance of bus-based machines. Example??

1.5

UMA & NUMA Arch Block Diagrams

Both are SMmultiprocessors


differing in
Memory Access
Delay format

UMA (CSM+ SAS)

NUMA (DM+ SAS= DSM)

Typical shared-address-space architectures: (a) Uniform-memory access shared-address-space


computer; (b) Uniform-memory-access shared-address-space computer with caches and memories;
(c) Non-uniform-memory-access shared-address-space computer with local memory only.

1.6

Homework:
self assessed problems
Please mark your solution & note
the marks you achieved
???????

1.7

Example Problem2:
Message Passing Multicomputer,
Local vs Remote memory data access delays
Consider 64-node multicomputer, each node comprises of
32-bit RISC processor having 250 MHz clock rate & 8 MB
local memory. The Local memory access requires 4 clock
cycles, remote comm initiate (setup) overhead is 15 clock
cycles & the Interconnection Network BW is 80 MB/sec.
Total number of instructions executed are 200,000.
If memory data load & store are 15% & 10% respectively
of the instructions, compute:(a)Load/ store time if all accesses are to local nodes
(b)Load/ store time if 20% of accesses are to remote nodes
note: Assume Packet lengths are variable (depend on addr
& data bytes) & communication protocol given (SCP???).
(note: the size of message packet fields is in multiple of bytes)
1.8

Example Problem2: contd


Message Passing Multicomputer,
Local vs Remote memory data access delays
Interconnection
network
Messages
Processor

Local
memory
Computers

1.9

Interconnection Networks (INs)


for
Parallel Computers

1.10

Interconnection Networks (INs)


for Parallel Computers
INs carry data/ synch info between
processors/ memory.
INs are made of switches /ports/ links
INs are classified as
o Shared Busses
o Static INs
o Dynamic INs .

1.11

Shared Bus based IN


All processors share a common bus for
exchanging data & synch info.
However, the bandwidth of the shared bus
is a major bottleneck for scalability of
parallelism.
Simple & small parallel machines (eg SMP)
use bus. Typical size is limited to dozens of
nodes.
Sun Enterprise servers & Intel multicore
processors based systems use Bus as IN.

1.12

Static vs Dynamic
Interconnection Networks (INs)
Static INs consist of point-to-point
communication links among
processing nodes and are also
referred to as direct networks.
Dynamic INs are built using switches
& communication links. Dynamic
networks are also referred to as
indirect networks.
1.13

Static vs Dynamic
Interconnection Networks (INs)

Classification of interconnection networks:


(a) a static network; and (b) a dynamic network.
1.14

Static Interconnection
Networks (INs)

1.15

Static INs: Topologies


A variety of static network topologies
have been proposed & implemented.
These topologies tradeoff performance
vs cost.
Commercial machines often implement
hybrids of multiple topologies for reasons
of packaging, cost & available components

1.16

Static INs: Topologies

Completely Connected
Star
Tree
Linear Array & Ring
2-D/ 3-D Mesh & Torus
Hypercube
k-d Mesh
1.17

Static INs:
Evaluation Parameters
Degree : Max of links connected
at any node of the network
Diameter: Distance (shortest path)
between the farthest nodes in the
network.
Link Cost: Total Number of links
required to implement the network.

1.18

Static INs topologies: Completely Connected


Each processor is connected to
every other processor,(number
of processors=p)
Diameter=1, number of links in
the network are p (p-1) / 2
While performance scales very
well, hardware complexity is not
realizable for large values of p.
In this sense, these networks
are static counterparts of
crossbars (dynamic IN)

(a) A completely-connected network of eight nodes;


(b) a star connected network of nine nodes.

1.19

Static INs topologies: Star


Every node is connected
only to a common node at the
center.
Distance between any pair
of nodes is 2. However, the
central node becomes a
bottleneck.
In this sense, star networks
are static counterparts of
buses.
(a) A completely-connected network of eight nodes;
(b) a star connected network of nine nodes.

1.20

Static INs topologies: Tree


Tree is a graph without any loops.
Tree involves root, leaves, parent,
child, levels, height
In Complete binary tree each node
has 3-neighbors (degree=3),1-parent
& 2-children (except root has no
parent & leaves have no child)
A p-node complete binary tree has
log(p+1) levels; 2i nodes at ith level
(root level is 0)
Tree has link cost = p-1
Tree has Diameter= 2 log(p+1) - 1
Trees can be laid out in 2D with no
wire crossings. This is an attractive
property of trees.

(a) Complete binary tree IN

(b) Fat-tree IN

1.21

Static INs topologies:


Fat-tree
Links higher up the
tree potentially
carry more traffic
than those at the
lower levels.
For this reason, a
variant called a fattree, fattens the
links as we go up
the tree.
(a) Complete binary tree IN

(b) Fat-tree IN
1.22

Das könnte Ihnen auch gefallen