ParallelAlgorithms Ranade

Abhiram G.
Ranade
Dept of CSE, IIT Bombay
¨ Availability of very powerful parallel computers, e.g.
CDAC PARAM Yuva
¨ Need to solve large problems
¨ Multicore desktop machines
¨ Inexpensive GPUs
¨ FPGA coprocessors
¨ Network of processors.
¡ Local computation: one operation/step/processor
¡ Communication with d neighbour s:
b words/L steps/proc.
¨ Shared Memory
¡ Local computation: one operation/step/processor
¡ Access to shared memor y: b words/L steps/processor
¨ “Fine grain”: small L, large b. Else “coarse”
¨ Separate program/processor. (possible: same
program)
¨ Synchronous vs. Asynchronous execution of
different processors
¨ Memory heirarchies. Interaction of cache pol icies.
Dramatic difference in memory access times for
hits vs. misses.
¨ Algorithm design is tricky!
Maximize Speedup = T1 / Tp
T1 = Time using best sequential algorithm
Tp = Time using parallel algorithm on p
processors.
Ideally speedup = p. Usually not possible.

¨ General strategy for designing parallel algorithms
¨ Brief Case Studies
¨ Summary of main themes
Not necessari ly in order:
¨ Design sequential algorithm
¡ Make sure you know how to solve the pr oblem!
¨ Identify Parallelism
¡ Sometimes obvious
¨ Assign available work to processors
¡ Balance load
¨ Minimize communication
¨ Matrix multiplication
¨ Prefix
¨ Sorting
¨ Sparse matrix multiplication
¨ N-body problems
¨ Parallel Search
For i = 1..N
For j = 1..N
C[i,j] = 0
For k = 1..N
C[i,j] += A[i,k] * B[k,j]
Each C[i,j] can be computed in parallel

31
23 32 B
13 22 31
12 21
A 11
31 21 11 O--O--O
32 22 21 O--O--O
33 23 13 O--O--O Computes C[3,3]
¨ Implementation needs fine granularity
¨ If your computer has coarse granul arity: treat each
ij as q x q submatri x.
¡ Amount of data input/ ”step”: 2 x q x q
¡ Amount of computation/ ”step”: q3
¨ If your network has other topol ogy, embed grid
network into your topology.
¡ Map grid vertices to your processors
¡ Map grid edges to paths in your network
¡ Your network simulates grid
¨ Algorithm is self synchronizing. Processors wait for
data
¨ Data can be on di sks in network, how to distribute
is an important question.
Input: A[1..N], Output: B[1..N]
B[1] = A[1]
For j = 2 to N
B[j] = B[j-1] + A[j] // + : associative op.
¨ Model of recurrences, carry look ahead, matching

work to processors, algorithmic primitive in sorting,
N-body comp
¨ Not parallel? jth iteration needs result of j-1th
¨ Construct C[1..N/2], C[j] = A[2j-1]+A[2j]
¨ Recursively solve. Answer D[1..N/2]
¨ D[j] = C[1]+C[2]+ … C[j]
= A[1]+A[2]+ …. A[2j] = B[2j]
¨ B[2j-1] = D[j] - A[2j]
¨ Tree implementation: A fed at leaves, C computed at

parents of leaves, B arrives at leaves.
¨ Fine grained algorithm
¨ For use on coarse grai ned networks, embed big
subtrees on 1 processor.
¨ Not necessar y to have complete binary trees
¨ No need for “-” operation.
¨ Algorithm can wait for data, self synchronizing.
¨ As much work and as many cl ever algorithms as in
sequential sorting.
¨ Merge Paradigm: Each processor sor ts its data
locally. Then all sublists are merged. Merging can
happen in parallel
Bucket sort paradigm:
¨ Assign one bucket to each processor.
¡ Use sampling to deter mine ranges.
¨ Each processor sends its keys to correct bucket.
¨ Each processor locally sorts the keys it receives.
¨ Manage communi cation properly: sorting is
communication intensive.
¨ Pack many keys into single message, do not send 1
message per key.
¨ Sorting is an important primitive, pay great
attention to communi cation network.
¨ Similar issues in database join operations.
¨ Key operation in many numerical codes, e.g. finite
element method.
¨ Invoked repeatedly, e.g. in solving linear systems,
differential equations.
¨ Dense matrix vector multiplication is easy -- similar
to matrix-matrix multiplication.
¨ Graph: derived from the problem, e.g. finite element
mesh graph
¨ Vector : V[j] present at vertex j of some graph
¨ Matrix: A[j,k] present in edge (j,k) of the graph
¨ Multiplication:
¡ Each vertex sends its data on all edges
¡ As the data moves along an edge , it is multiplied by the
coefficient
¡ The products are added as they arr ive into the ver tices.
¨ Partition the graph among available processors s.t.
¡ Each processor gets equal no of edges (load balance)
¡ Most edges have both endpoints on the same processor
(minimize communication)
¨ Classic, hard problem
¨ Many heuristics known.
¡ Spectral Methods
¡ Multi-level methods based on graph coarsening, e.g. METIS
¨ Good partitioning is possible for “well shaped
meshes” arising out of finite element methods.
Input: positions of n stars
Output: force on each star due to al l others.
¨ Naïve Algorithm: O(n 2), for each star cal culate the
force due to every other.
¨ Fast Multipole Method: O(n). Based on clustering
stars and consi dering cluster-cluster interactions +
star-cluster + star-star interactions.
¨ Star data resides at leaves of oct tree.
¨ Cluster data resides at internal nodes
¡ Computed by passing data along tree edges.
¨ Cluster-cluster interaction: data flows between
neighbours at same level of tree.
¨ Dataflow is “pyramidal”.
¡ Possible to nicely embed p yramids in most networks.
¡ More complexity because p yramid might have variable
number of levels if stars are distributed unevenly.
¨ “Find a combinatorial object satisfying properties
…”, e.g. packing, TSP, …
¨ Naturally parallel, but unstructur ed growth of
search tree.
¨ Difficulty in load balancing.
¨ Key question: can we maintain a distributed queue?
Best solution found? Bounds
¨ Prefix like computations useful in maintaining
distributed queue.
¨ Randomness also useful. Each processor asks for
work from a random processor when i ts work
queue runs out.
¨ Broadcasting required to maintain best solution,
bounds..
¨ Graph embedding
¡ Develop algorithm for one network, embed that networ k
on what you actually ha ve
¡ Form graph of dataflow/data dependence , embed that in
network you have
¨ Solve problem locally, merge global results
¡ Sorting
¡ Matrix multiplication: locally implement subblock
multiplication
¡ Useful in graph algorithms
¨ Randomization
¡ Useful in load balancing
¡ Also used in communication, select paths randomly
¡ Symmetry breaking
¡ Sampling
¨ Co-ordination
¡ Prefix is useful
¡ Distributed data structures, e.g. queues, hash tables
¨ Communication Patter ns
¡ All to all .. Sorting
¡ Permutation
¡ Broadcasting
¨ Vast subject
¨ Many clever algorithms
¨ Many models of computati on
¨ Many issues besides algorithms: how to express in
programming language..
¨ Quick tour of some of the i mportant ideas.

ParallelAlgorithms Ranade

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

ParallelAlgorithms Ranade

Hochgeladen von

Copyright:

Verfügbare Formate

Abhiram G.

Ideally speedup = p. Usually not possible.

Each C[i,j] can be computed in parallel

¨ Model of recurrences, carry look ahead, matching

¨ Tree implementation: A fed at leaves, C computed at

Das könnte Ihnen auch gefallen