Sie sind auf Seite 1von 63

Parallel Programming Concepts

OpenHPI Course
Week 5 : Distributed Memory Parallelism
Unit 5.1: Hardware
Dr. Peter Trger + Teaching Team

Summary: Week 4
2

Accelerators enable major speedup for data parallelism


SIMD execution model (no branching)
Memory latency managed with many light-weight threads
Tackle diversity with OpenCL
Loop parallelism with index ranges
Kernels in C, compiled at runtime
Complex memory hierarchy supported
Getting fast is easy, getting faster is hard
Best practices for accelerators
Hardware knowledge needed
What if my computational problem still demands more power?
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Parallelism for
Speed compute faster
Throughput compute more in the same time
Scalability compute faster / more with additional resources
Huge scalability only with shared nothing systems

Processing Element A1
Processing Element A2
Processing Element A3

Main Memory

Main Memory

Still also depends on application characteristics

Scaling Up

Scaling Out
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Processing Element B1
Processing Element B2
Processing Element B3

Parallel Hardware
Shared memory system
Typically a single machine, common address space for tasks
Hardware scaling is limited (power / memory wall)
Shared nothing (distributed memory) system
Tasks on multiple machines, can only access local memory
Global task coordination by explicit messaging
Easy scale-out by adding machines to the network
Task


Cache

Processing
Element


Cache

Processing
Element

Local

Memory

Processing
Element

Local

Memory

Processing
Element

Shared Memory

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Message

Task

Message

Task

Message

Task

Message

Parallel Hardware
5

Shared memory system collection of processors


Integrated machine for capacity computing
Prepared for a large variety of problems
Shared-nothing system collection of computers
Clusters and supercomputers for capability computing
Installation to solve few problems in the best way
Parallel software must be able leverage multiple machines
at the same time
Difference to distributed systems (Internet, Cloud)
Single organizational domain, managed as a whole
Single parallel application at a time,
no separation of client and server application
Hybrids are possible (e.g. HPC in Amazon AWS cloud)
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Shared Nothing: Clusters


Collection of stand-alone machines connected by a local network
Cost-effective technique for a large-scale parallel computer
Users are builders, have control over their system
Synchronization much slower than in shared memory
Task granularity becomes an issue

Local

Memory

Processing
Element

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Message

Local

Memory

Processing
Element

Message

Task

Message

Task

Message

Shared Nothing: Supercomputers


7

Supercomputers / Massively Parallel Processing (MPP) systems


(Hierarchical) cluster with a lot of processors
Still standard hardware, but specialized setup
High-performance interconnection network
For massive data-parallel applications, mostly simulations
(weapons, climate, earthquakes, airplanes, car crashes, ...)
Examples (Nov 2013)
BlueGene/Q, 1.5 million cores, 1.5 PB memory, 17.1 TFlops
Tianhe-2, 3.1 million cores,
1 PB memory, 17.808 kW power,
33.86 PFlops (quadrillions
calculations per second)
Annual ranking with the TOP500 list
(www.top500.org)
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Example
IBM System Technology Group

Blue Gene/Q

3. Compute card:
One chip module,
16 GB DDR3 Memory,
Heat Spreader for H2O Cooling

4. Node Card:
32 Compute Cards,
Optical Modules, Link Chips; 5D Torus

2. Single Chip Module


1. Chip:
16+2 !P
cores

5b. IO drawer:
8 IO cards w/16 GB
8 PCIe Gen2 x8 slots
3D I/O torus

7. System:
96 racks, 20PF/s

5a. Midplane:
16 Node Cards

Sustained single node perf: 10x P, 20x L


MF/Watt: (6x) P, (10x) L (~2GF/W, Green 500 criteria)
Software and hardware support for programming models
for exploitation of node hardware concurrency

6. Rack: 2 Midplanes
2011 IBM Corporation

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Interconnection Networks
9

Bus systems
Static approach, low costs

PE

PE

PE

PE

Shared communication path,


broadcasting of information
Scalability issues with shared bus

PE

Completely connected networks


Static approach, high costs

PE

PE
PE

PE
PE

Only direct links, optimal performance

PE
PE

Star-connected networks
Static approach with central switch
Less links, still very good performance
Scalability depends on central switch

PE
PE

PE
Switch

PE
PE

PE
PE

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

PE

Interconnection Networks
10

Crossbar switch

PE1

Dynamic switch-based network

PE2

PE3

PEn

PE1

Supports multiple parallel direct


connections without collisions

PE2
PE3

Less edges than completely connected


network, but still scalability issues

PEn

Fat tree
Use wider links in higher parts of the
interconnect tree

Switch

Combine tree design advantages with


solution for root node scalability
Communication distance between any
two nodes is no more than 2 log #PEs

Switch

PE
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Switch

Switch

PE

Switch

PE

PE

Switch

PE

PE

Fully connected graph optimal connectivity (but high cost)

Interconnection Networks
11

Linear array

PE

PE

PE

PE

Ring
Linear array with connected endings

PE

PE

PE

PE

83

N-way D-dimensional mesh


Matrix of processing elements
Not more than N neighbor links
Mesh
and Torus in D dimensions
Structured

PE

PE

PE

PE

PE

PE

PE

PE

PE

N-way
D-dimensional
torus
Compromise
between cost and connectivity
Mesh with wrap-around connection

4-way 2D mesh

4-way 2D torus

8-way 2D mesh

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

PE

PE

PE

PE

PE

PE

PE

PE

PE

Example: Blue Gene/Q 5D Torus


5D torus interconnect in Blue Gene/Q supercomputer
2 GB/s on all 10 links, 80ns latency to direct neighbors
Additional link for
communication
with I/O nodes

[IBM]

12

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Parallel Programming Concepts


OpenHPI Course
Week 5 : Distributed Memory Parallelism
Unit 5.2: Granularity and Task Mapping
Dr. Peter Trger + Teaching Team

Workload
14

Last week showed that task granularity may be flexible


Example: OpenCL work group size
But: Communication overhead becomes significant now
What is the right level of task granularity ?
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Surface-To-Volume Effect
15

Envision the work to be done


(in parallel) as sliced 3D cube
Not a demand on the application
data, just a representation
Slicing represents splitting into tasks
Computational work of a task
Proportional to the volume of the cube slice
Represents the granularity of decomposition
Communication requirements of the task
Proportional to the surface of the cube slice
communication-to-computation ratio
Fine granularity: Communication high, computation low
Coarse granularity: Communication low, computation high
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Surface-To-Volume Effect
Fine-grained decomposition for
using all processing elements ?
Coarse-grained decomposition
to reduce communication
overhead ?
A tradeoff question !

[nicerweb.com]

16

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Surface-To-Volume Effect
Heatmap example with
64 data cells
Version (a): 64 tasks
64x4=
256 messages,
256 data values
64 processing
elements used in
parallel
Version (b): 4 tasks
16 messages,
64 data values
4 processing elements
used in parallel
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

[Foster]

17

Surface-To-Volume Effect
Rule of thumb
Agglomerate tasks to avoid communication
Stop when parallelism is no longer exploited well enough
Agglomerate in all dimensions at the same time
Influencing factors
Communication technology + topology
Serial performance per processing element
Degree of application parallelism
Task communication vs. network topology
Resulting task graph must be
mapped to network topology
Task-to-task communication
may need multiple hops
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

[Foster]

18

The Task Mapping Problem


19

Given
a number of homogeneous processing elements
with performance characteristics,
some interconnection topology of the processing elements
with performance characteristics,
an application dividable into parallel tasks.
Questions:
What is the optimal task granularity ?
How should the tasks be placed on processing elements ?
Do we still get speedup / scale-up by this parallelization ?
Task mapping is still research, mostly manual tuning today
More options with configurable networks / dynamic routing
Reconfiguration of hardware communication paths
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Parallel Programming Concepts


OpenHPI Course
Week 5 : Distributed Memory Parallelism
Unit 5.3: Programming with MPI
Dr. Peter Trger + Teaching Team

Message Passing
21

Parallel programming paradigm for shared nothing environments


Implementations for shared memory available,
but typically not the best approach
Users submit their message passing program & data as job
Cluster management system creates program instances
Cluster Management Software
Job
Submission
Host

Instance
1

Instance
0
Instance
2

Execution Hosts

Application
Instance
3

Single Program Multiple Data (SPMD)


Instance 0

22

// (determine rank and comm_size)


int token;
if (rank != 0) {
// Receive from your left neighbor if you are not rank 0
MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, rank - 1);
} else {
// Set the token's value if you are rank 0
token = -1;
}
// Send your local token value to your right neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,
0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.
if (rank == 0) {
MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1);
}

SPMD program
// (determine rank and comm_size)
int token;
if (rank != 0) {
// Receive from your left neighbor if you are not rank
0
MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, rank - 1);
} else {
// Set the token's value if you are rank 0
token = -1;
}
// Send your local token value to your right neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,
0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.
if (rank == 0) {
MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1);
}

Instance 2

Instance 1
// (determine rank and comm_size)
int token;
if (rank != 0) {
// Receive from your left neighbor if you are not rank 0
MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, rank - 1);
} else {
// Set the token's value if you are rank 0
token = -1;
}
// Send your local token value to your right neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,
0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.
if (rank == 0) {
MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1);
}

// (determine rank and comm_size)


int token;
if (rank != 0) {
// Receive from your left neighbor if you are not rank 0
MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, rank - 1);
} else {
// Set the token's value if you are rank 0
token = -1;
}
// Send your local token value to your right neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,
0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.
if (rank == 0) {
MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1);
}

Instance 3
Input data

Instance 4
// (determine rank and comm_size)
int token;
if (rank != 0) {
// Receive from your left neighbor if you are not rank 0
MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, rank - 1);
} else {
// Set the token's value if you are rank 0
token = -1;
}
// Send your local token value to your right neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,
0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.
if (rank == 0) {
MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1);
}

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

// (determine rank and comm_size)


int token;
if (rank != 0) {
// Receive from your left neighbor if you are not rank 0
MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, rank - 1);
} else {
// Set the token's value if you are rank 0
token = -1;
}
// Send your local token value to your right neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,
0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.
if (rank == 0) {
MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1);
}

Message Passing Interface (MPI)


23

Many optimized messaging libraries for shared nothing


environments, developed by networking hardware vendors
Need for standardized API solution: Message Passing Interface
Definition of API syntax and semantics
Enables source code portability, not interoperability
Software independent from hardware concepts
Fixed number of process instances, defined on startup
Point-to-point and collective communication
Focus on efficiency of communication and memory usage
MPI Forum standard
Consortium of industry and academia
MPI 1.0 (1994), 2.0 (1997), 3.0 (2012)

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

MPI Communicators
Each application instance (process) has a rank, starting at zero
Communicator: Handle for a group of processes
Unique rank numbers inside the communicator group
Instance can determine communicator size and own rank
Default communicator MPI_COMM_WORLD
Instance may be in multiple communicator groups
Rank 0
Size 4

Rank 1
Size 4

Rank 2
Size 4

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Rank 3
Size 4

Communicator

24

Communication
25

Point-to-point communication between instances


int MPI_Send(void* buf, int count, MPI_Datatype type,
int destRank, int tag, MPI_Comm com);
int MPI_Recv(void* buf, int count, MPI_Datatype type,
int sourceRank, int tag, MPI_Comm com);
Parameters
Send / receive buffer + size + data type
Sender provides receiver rank, receiver provides sender rank
Arbitrary message tag
Source / destination identified by [tag, rank, communicator] tuple
Default send / receive will block until the match occurs
Useful constants: MPI_ANY_TAG, MPI_ANY_SOURCE, MPI_ANY_DEST
Variations in the API for different buffering behavior
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Example: Ring communication


// (determine rank and comm_size)
int token;
if (rank != 0) {
// Receive from your left neighbor if you are not rank 0
MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, rank - 1);
} else {
// Set the token's value if you are rank 0
token = -1;
}
// Send your local token value to your right neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,
0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.
if (rank == 0) {
MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1);
}
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

[mpitutorial.com]

26

Deadlocks
27

Consider:

int MPI_Send(void* buf, int count, MPI_Datatype type,


int destRank, int tag, MPI_Comm com);

int a[10], b[10], myrank;


MPI_Status status;
...
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
if (myrank == 0) {
MPI_Send(a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD);
MPI_Send(b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD);
}
else if (myrank == 1) {
MPI_Recv(b, 10, MPI_INT, 0, 2, MPI_COMM_WORLD);
MPI_Recv(a, 10, MPI_INT, 0, 1, MPI_COMM_WORLD);
}
...
If MPI_Send is blocking, there is a deadlock.
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Collective Communication
28

Point-to-point communication vs. collective communication


Use cases: Synchronization, data distribution & gathering
All processes in a (communicator) group communicate together
One sender with multiple receivers (one-to-all)
Multiple senders with one receiver (all-to-one)
Multiple senders and multiple receivers (all-to-all)
Typical pattern in supercomputer applications
Participants continue if the group communication is done
Always blocking operation
Must be executed by all processes in the group
No assumptions on the state of other participants on return

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Barrier
29

Communicator members block until everybody reaches the barrier


// (determine rank and comm_size)
int token;
if (rank != 0) {
// Receive from your left neighbor if you are not rank 0
MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, rank - 1);
} else {
// Set the token's value if you are rank 0
token = -1;
}
// Send your local token value to your right neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,
0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.
if (rank == 0) {
MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1);
}

// (determine rank and comm_size)


int token;
if (rank != 0) {
// Receive from your left neighbor if you are not rank 0
MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, rank - 1);
} else {
// Set the token's value if you are rank 0
token = -1;
}
// Send your local token value to your right neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,
0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.
if (rank == 0) {
MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1);
}

// Receive from your left neighbor if you are not rank 0


MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, rank - 1);
} else {
// Set the token's value if you are rank 0
token = -1;
}
// Send your local token value to your right neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,
0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.
if (rank == 0) {
MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1);
}

// (determine rank and comm_size)


int token;
if (rank != 0) {
// Receive from your left neighbor if you are not rank 0
MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, rank - 1);
} else {
// Set the token's value if you are rank 0
token = -1;
}
// Send your local token value to your right neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,
0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.
if (rank == 0) {
MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1);
}

MPI_Barrier(comm)

// (determine rank and comm_size)


int token;
if (rank != 0) {

MPI_Barrier(comm)
// (determine rank and comm_size)
int token;
if (rank != 0) {
// Receive from your left neighbor if you are not rank 0
MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, rank - 1);
} else {
// Set the token's value if you are rank 0
token = -1;
}
// Send your local token value to your right neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,
0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.
if (rank == 0) {
MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1);
}

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

MPI_Barrier(comm)
// (determine rank and comm_size)
int token;
if (rank != 0) {
// Receive from your left neighbor if you are not rank 0
MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, rank - 1);
} else {
// Set the token's value if you are rank 0
token = -1;
}
// Send your local token value to your right neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,
0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.
if (rank == 0) {
MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1);
}

Broadcast
int MPI_Bcast( void *buffer, int count,
MPI_Datatype datatype, int rootRank, MPI_Comm comm )
rootRank is the rank of the chosen root process
Root process broadcasts data in buffer to all other processes,
itself included
On return, all processes have the same data in their buffer
Data

Data

D0

D0
Broadcast

Processes

Processes

30

D0
D0
D0
D0
D0

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Scatter
int MPI_Scatter(void *sendbuf, int sendcnt,
MPI_Datatype sendtype, void *recvbuf, int recvcnt,
MPI_Datatype recvtype, int rootRank, MPI_Comm comm)
sendbuf buffer on root process is divided, parts are sent
to all processes, including root
MPI_SCATTERV allows varying count of data per rank

Data

Data
D0

Scatter

Gather

Processes

D0 D1 D2 D3 D4 D5
Processes

31

D1
D2
D3
D4
D5

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Gather
int MPI_Gather(void *sendbuf, int sendcnt,
MPI_Datatype sendtype, void *recvbuf, int recvcnt,
MPI_Datatype recvtype, int rootRank, MPI_Comm comm)
Each process (including the root process) sends the data in its
sendbuf buffer to the root process
Incoming data in recvbuf is stored in rank order
recvbuf parameter is ignored for all non-root processes
Data

Data
D0

Scatter

Gather

Processes

D0 D1 D2 D3 D4 D5
Processes

32

D1
D2
D3
D4
D5

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Reduction
int MPI_Reduce(void *sendbuf, void *recvbuf,
int count, MPI_Datatype datatype, MPI_Op op,
int rootRank, MPI_Comm comm)
Similar to MPI_Gather
Additional reduction operation op to aggregate received
data: maximum, minimum, sum, product, boolean
operators, max-min, min-min
MPI implementation can overlap communication and
reduction calculation for faster results
Data

Data

D0A

D0A + D0B + D0C

D0B

Reduce +

D0C

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Processes

Processes

33

D0B
D0C

Example: MPI_Scatter + MPI_Reduce


34
/* -- E. van den Berg
07/10/2001 -- */!
#include <stdio.h>!
#include "mpi.h"!
!
int main (int argc, char *argv[]) { !
int data[] = {1, 2, 3, 4, 5, 6, 7}; // Size must be >= #processors!
int rank, i = -1, j = -1;!
!
MPI_Init (&argc, &argv);!
MPI_Comm_rank (MPI_COMM_WORLD, &rank);!
!
MPI_Scatter ((void *)data, 1, MPI_INT, (void *)&i , !
1, MPI_INT, 0, MPI_COMM_WORLD);!
printf ("[%d] Received i = %d\n", rank, i);!
!
MPI_Reduce ((void *)&i, (void *)&j, 1, MPI_INT, MPI_PROD, !
0, MPI_COMM_WORLD);!
!
printf ("[%d] j = %d\n", rank, j);!
MPI_Finalize();
!
return 0;!
}!
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

What Else
35

Variations:
MPI_ISend, MPI_Sendrecv, MPI_Allgather, MPI_Alltoall,
Definition of virtual topologies for better task mapping
Complex data types
Packing / Unpacking (sprintf / sscanf)
Group / Communicator Management
Error Handling
Profiling Interface
Several implementations available
MPICH - Argonne National Laboratory
OpenMPI - Consortium of Universities and Industry
...
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Parallel Programming Concepts


OpenHPI Course
Week 5 : Distributed Memory Parallelism
Unit 5.4: Programming with Channels
Dr. Peter Trger + Teaching Team

Communicating Sequential Processes


37

Formal process algebra to describe concurrent systems


Developed by Tony Hoare at University of Oxford (1977)
Also inventor of QuickSort and Hoare logic
Computer systems act and interact with the environment
Decomposition in subsystems (processes) that operate
concurrently inside the system
Processes interact with other processes, or the environment
Book: T. Hoare, Communicating Sequential Processes, 1985
A mathematical theory, described with algebraic laws
CSP channel concept available in many programming
languages for shared nothing systems
Complete approach implemented in Occam language

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

CSP: Processes
38

Behavior of real-world objects can be described through their


interaction with other objects
Leave out internal implementation details
Interface of a process is described as set of atomic events
Example: ATM and User, both modeled as processes
card event insertion of a credit card in an ATM card slot
money event extraction of money from the ATM dispenser
Alphabet - set of relevant events for an object description
May never happen, interaction is restricted to these events
ATM = User = {card, money}
A CSP process is the behavior of an object, described with its
alphabet

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Communication in CSP
39

Special class of event: Communication


Modeled as unidirectional channel between processes
Channel name is a member of the alphabets of both processes
Send activity described by multiple c.v events
Channel approach assumes rendezvous behavior
Sender and receiver block on the channel operation until the
message is transmitted
Implicit barrier based on communication
With formal foundation, mathematical proofs are possible
When two concurrent processes communicate with each other
only over a single channel, they cannot deadlock.
Network of non-stopping processes which is free of cycles
cannot deadlock.

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Whats the Deal ?


40

Any possible system can be modeled through event chains


Enables mathematical proofs for deadlock freedom,
based on the basic assumptions of the formalism
(e.g. single channel assumption)
Some tools available (check readings page)
CSP was the formal base for the Occam language
Language constructs follow the formalism
Mathematical reasoning about the behavior of written code
Still active research (Welsh University), channel concept
frequently adopted
CSP channel implementations for Java, MPI, Go, C, Python
Other formalisms based on CSP, e.g. Task/Channel model

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Channels in Scala
41

actor {
var out: OutputChannel[String] = null
val child = actor {
react {
case "go" => out ! "hello"
}
}
Sending channels in messages
val channel = new Channel[String]
out = channel
case class ReplyTo(out:
child ! "go"
OutputChannel[String])
channel.receive {
case msg => println(msg.length)
val child = actor {
}
react {
}
case ReplyTo(out) => out ! "hello"
}
Scope-based channel sharing
}

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

actor {
val channel = new Channel[String]
child ! ReplyTo(channel)
channel.receive {
case msg => println(msg.length)
}
}

Channels in Go
42

package main
import fmt fmt
func sayHello (ch1 chan string) {
ch1 <- Hello World\n
}
func main() {
ch1 := make(chan string)
go sayHello(ch1)
fmt.Printf(<-ch1)
}
$ 8g chanHello.go
$ 8l -o chanHello chanHello.8
$ ./chanHello
Hello World
$

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Concurrent sayHello function


Put value into channel ch1
Program start, create channel
Run sayHello concurrently
Read value from ch1, print it

Compile application
Link application
Run application

Channels in Go
43

select concept allows to switch between available channels


All channels are evaluated
If multiple can proceed, one is chosen randomly
Default clause if no channel is available
select {
case v := <-ch1:
fmt.Println("channel 1 sends", v)
case v := <-ch2:
fmt.Println("channel 2 sends", v)
default: // optional
fmt.Println("neither channel was ready")
}
Channels are typically first-class language constructs
Example: Client provides a response channel in the request
Popular solution to get deterministic behavior
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Task/Channel Model
44

Computational model for multi-computer by Ian Foster


Similar concepts to CSP
Parallel computation consists of one or more tasks
Tasks execute concurrently
Number of tasks can vary during execution
Task: Serial program with local memory
A task has in-ports and outports as interface to the
environment
Basic actions: Read / write local memory,
send message on outport,
receive message on in-port,
create new task, terminate

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Task/Channel Model
45

Outport / in-port pairs are connected by channels


Channels can be created and deleted
Channels can be referenced as ports,
which can be part of a message
Send operation is non-blocking
Receive operation is blocking
Messages in a channel stay in order
Tasks are mapped to physical processors by the execution
environment
Multiple tasks can be mapped to one processor
Data locality is explicit part of the model
Channels can model control and data dependencies

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Programming With Channels


46

Channel-only parallel programs have advantages


Performance optimization does not influence semantics
Example: Shared-memory channels for some parts
Task mapping does not influence semantics
Align number of tasks for the problem,
not for the execution environment
Improves scalability of implementation
Modular design with well-defined interfaces
Communication should be balanced between tasks
Each task should only communicate with a small group of
neighbors

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Parallel Programming Concepts


OpenHPI Course
Week 5 : Distributed Memory Parallelism
Unit 5.5: Programming with Actors
Dr. Peter Trger + Teaching Team

Actor Model
48

Carl Hewitt, Peter Bishop and Richard Steiger. A Universal Modular


Actor Formalism for Artificial Intelligence IJCAI 1973.
Mathematical model for concurrent computation
Actor as computational primitive
Local decisions, concurrently sends / receives messages
Has a mailbox for incoming messages
Concurrently creates more actors
Asynchronous one-way message sending
Changing topology allowed, typically no order guarantees
Recipient is identified by mailing address
Actors can send their own identity to other actors
Available as programming language extension or library
in many environments
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Erlang Ericsson Language


49

Functional language with actor support


Designed for large-scale concurrency
First version in 1986 by Joe Armstrong, Ericsson Labs
Available as open source since 1998
Language goals driven by Ericsson product development
Scalable distributed execution of phone call handling software
with large number of concurrent activities
Fault-tolerant operation under timing constraints
Online software update
Users
Amazon EC2 SimpleDB , Delicious, Facebook chat, T-Mobile
SMS and authentication, Motorola call processing, Ericsson
GPRS and 3G mobile network products, CouchDB, EJabberD,
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Concurrency in Erlang
50

Concurrency Oriented Programming


Actor processes are completely independent (shared nothing)
Synchronization and data exchange with message passing
Each actor process has an unforgeable name
If you know the name, you can send a message
Default approach is fire-and-forget
You can monitor remote actor processes
Using this gives you
Opportunity for massive parallelism
No additional penalty for distribution, despite latency issues
Easier fault tolerance capabilities
Concurrency by default
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Actors in Erlang
51

Communication via message passing is part of the language


Send never fails, works asynchronously (PID ! Message)
Actors have mailbox functionality
Queue of received messages, selective fetching
Only messages from same source arrive in-order
receive statement with set of clauses, pattern matching
Process is suspended in receive operation until a match
receive
Pattern1 when Guard1 -> expr1, expr2, ..., expr_n;
Pattern2 when Guard2 -> expr1, expr2, ..., expr_n;
Other
-> expr1, expr2, ..., expr_n
end

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Erlang Example: Ping Pong Actors


52

-module(tut15).
-export([test/0, ping/2, pong/0]).

Functions exported + #args

ping(0, Pong_PID) ->


Pong_PID ! finished,
io:format("Ping finished~n", []);

[erlang.org]

ping(N, Pong_PID) ->


Pong_PID ! {ping, self()},
receive
pong ->
io:format("Ping received pong~n", [])
end,
ping(N - 1, Pong_PID).
pong() ->
receive
finished ->
io:format("Pong finished~n", []);
{ping, Ping_PID} ->
io:format("Pong received ping~n", []),
Ping_PID ! pong,
pong()
end.
test() ->
Pong_PID = spawn(tut15, pong, []),
spawn(tut15, ping, [3, Pong_PID]).

Ping actor,
sending message to Pong
Blocking recursive receive,
scanning the mailbox

Pong actor
Blocking recursive receive,
scanning the mailbox

Sending message to Ping

Start Ping and Pong actors

Actors in Scala
53

Actor-based concurrency in Scala, similar to Erlang


Concurrency abstraction on top of threads or processes
Communication by non-blocking send operation and blocking
receive operation with matching functionality
actor {
var sum = 0
loop {
receive {
case Data(bytes)
=> sum += hash(bytes)
case GetSum(requester) => requester ! sum
}}}
All constructs are library functions (actor, loop, receiver, !)
Alternative self.receiveWithin() call with timeout
Case classes act as message type representation
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Scala Example: Counter Actor


54

import scala.actors.Actor
import scala.actors.Actor._
case class Inc(amount: Int)
case class Value
class Counter extends Actor {
var counter: Int = 0;
def act() = {
while (true) {
receive {
case Inc(amount) =>
counter += amount
case Value =>
println("Value is "+counter)
exit()
}}}}
object ActorTest extends Application {
val counter = new Counter
counter.start()
for (i <- 0 until 100000) {
counter ! Inc(1)
}
counter ! Value
// Output: Value is 100000
}
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Case classes,
acting as message types
Implementation of
the counter actor
Blocking receive loop,
scanning the mailbox

Start the counter actor


Send an Inc message
to the counter actor
Send a Value message
to the counter actor

Actor Deadlocks
Synchronous send operator !? available in Scala
Sends a message and blocks in receive afterwards
Intended for request-response pattern
// actorA
actorB !? Msg1(value) match {
case Response1(r) =>
//
}
receive {
case Msg2(value) =>
reply(Response2(value))
}

// actorB
actorA !? Msg2(value) match {
case Response2(r) =>
//
}
receive {
case Msg1(value) =>
reply(Response1(value))
}

Original asynchronous send makes deadlocks less probable


// actorA
actorB ! Msg1(value)
while (true) {
receive {
case Msg2(value) =>
reply(Response2(value))
case Response1(r) =>
// ...
}}

// actorB
actorA ! Msg2(value)
while (true) {
receive {
case Msg1(value) =>
reply(Response1(value))
case Response2(r) =>
// ...
}}

[http://savanne.be/articles/concurrency-in-erlang-scala/]

55

Parallel Programming Concepts


OpenHPI Course
Week 5 : Distributed Memory Parallelism
Unit 5.6: Programming with MapReduce
Dr. Peter Trger + Teaching Team

MapReduce
57

Programming model for parallel processing of large data sets


Inspired by map() and reduce() in functional programming
Intended for best scalability in data parallelism
Huge interest started with Google Research publication
Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Simplified Data Processing on Large Clusters
Google products rely on internal implementation
Apache Hadoop: Widely known open source implementation
Scales to thousands of nodes
Has shown to process petabytes of data
Cluster infrastructure with custom file system (HDFS)
Parallel programming on very high abstraction level
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

58

Map step
Convert input tuples
[key, value] with map()
function into one / multiple
intermediate tuples
[key2, value2] per input
Shuffle step: Collect all intermediate tuples with the same key
Reduce step
Combine all intermediate tuples with the same key by
some reduce() function to one result per key
Developer just defines stateless map() and reduce() functions
Framework automatically ensures parallelization
Persistence layer needed for input and output only
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

[developers.google.com]

MapReduce Concept

Example: Character Counting


59

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Java Example: Hadoop Word Count


public class WordCount {
public static class Map extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}}}
public static class Reduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}}...}
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

[hadoop.apache.org]

60

MapReduce Data Flow

[developer.yahoo.com]

61

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Advantages
62

Developer never implements communication or synchronization,


implicitly done by the framework
Allows transparent fault tolerance and optimization
Running map and reduce tasks are stateless
Only rely on their input, produce their own output
Repeated execution in case of failing nodes
Redundant execution for compensating nodes with different
performance characteristics
Scale-out only limited by
Distributed file system performance (input / output data)
Shuffle step communication performance
Chaining of map/reduce tasks is very common in practice
But: Demands embarrassingly parallel problem
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Summary: Week 5
63

Shared nothing systems provide very good scalability


Adding new processing elements not limited by walls
Different options for interconnect technology
Task granularity is essential
Surface-to-volume effect
Task mapping problem
De-facto standard is MPI programming
High level abstractions with
Channels
Actors
MapReduce
What steps / strategy would you apply
to parallelize a given compute-intense program?
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Das könnte Ihnen auch gefallen