Sie sind auf Seite 1von 29

Abhiram G.

Ranade
Dept of CSE, IIT Bombay
¨ Availability of very powerful parallel computers, e.g.
CDAC PARAM Yuva
¨ Need to solve large problems
¨ Multicore desktop machines
¨ Inexpensive GPUs
¨ FPGA coprocessors
¨ Network of processors.
¡ Local computation: one operation/step/processor
¡ Communication with d neighbour s:
b words/L steps/proc.
¨ Shared Memory
¡ Local computation: one operation/step/processor
¡ Access to shared memor y: b words/L steps/processor
¨ “Fine grain”: small L, large b. Else “coarse”
¨ Separate program/processor. (possible: same
program)
¨ Synchronous vs. Asynchronous execution of
different processors
¨ Memory heirarchies. Interaction of cache pol icies.
Dramatic difference in memory access times for
hits vs. misses.
¨ Algorithm design is tricky!
Maximize Speedup = T1 / Tp
T1 = Time using best sequential algorithm
Tp = Time using parallel algorithm on p
processors.

Ideally speedup = p. Usually not possible.


¨ General strategy for designing parallel algorithms
¨ Brief Case Studies
¨ Summary of main themes
Not necessari ly in order:
¨ Design sequential algorithm
¡ Make sure you know how to solve the pr oblem!
¨ Identify Parallelism
¡ Sometimes obvious
¨ Assign available work to processors
¡ Balance load
¨ Minimize communication
¨ Matrix multiplication
¨ Prefix
¨ Sorting
¨ Sparse matrix multiplication
¨ N-body problems
¨ Parallel Search
For i = 1..N
For j = 1..N
C[i,j] = 0
For k = 1..N
C[i,j] += A[i,k] * B[k,j]

Each C[i,j] can be computed in parallel


31
23 32 B
13 22 31
12 21
A 11
31 21 11 O--O--O
32 22 21 O--O--O
33 23 13 O--O--O Computes C[3,3]
¨ Implementation needs fine granularity
¨ If your computer has coarse granul arity: treat each
ij as q x q submatri x.
¡ Amount of data input/ ”step”: 2 x q x q
¡ Amount of computation/ ”step”: q3
¨ If your network has other topol ogy, embed grid
network into your topology.
¡ Map grid vertices to your processors
¡ Map grid edges to paths in your network
¡ Your network simulates grid
¨ Algorithm is self synchronizing. Processors wait for
data
¨ Data can be on di sks in network, how to distribute
is an important question.
Input: A[1..N], Output: B[1..N]
B[1] = A[1]
For j = 2 to N
B[j] = B[j-1] + A[j] // + : associative op.

¨ Model of recurrences, carry look ahead, matching


work to processors, algorithmic primitive in sorting,
N-body comp
¨ Not parallel? jth iteration needs result of j-1th
¨ Construct C[1..N/2], C[j] = A[2j-1]+A[2j]
¨ Recursively solve. Answer D[1..N/2]
¨ D[j] = C[1]+C[2]+ … C[j]
= A[1]+A[2]+ …. A[2j] = B[2j]
¨ B[2j-1] = D[j] - A[2j]

¨ Tree implementation: A fed at leaves, C computed at


parents of leaves, B arrives at leaves.
¨ Fine grained algorithm
¨ For use on coarse grai ned networks, embed big
subtrees on 1 processor.
¨ Not necessar y to have complete binary trees
¨ No need for “-” operation.
¨ Algorithm can wait for data, self synchronizing.
¨ As much work and as many cl ever algorithms as in
sequential sorting.
¨ Merge Paradigm: Each processor sor ts its data
locally. Then all sublists are merged. Merging can
happen in parallel
Bucket sort paradigm:
¨ Assign one bucket to each processor.
¡ Use sampling to deter mine ranges.
¨ Each processor sends its keys to correct bucket.
¨ Each processor locally sorts the keys it receives.
¨ Manage communi cation properly: sorting is
communication intensive.
¨ Pack many keys into single message, do not send 1
message per key.
¨ Sorting is an important primitive, pay great
attention to communi cation network.
¨ Similar issues in database join operations.
¨ Key operation in many numerical codes, e.g. finite
element method.
¨ Invoked repeatedly, e.g. in solving linear systems,
differential equations.
¨ Dense matrix vector multiplication is easy -- similar
to matrix-matrix multiplication.
¨ Graph: derived from the problem, e.g. finite element
mesh graph
¨ Vector : V[j] present at vertex j of some graph
¨ Matrix: A[j,k] present in edge (j,k) of the graph
¨ Multiplication:
¡ Each vertex sends its data on all edges
¡ As the data moves along an edge , it is multiplied by the
coefficient
¡ The products are added as they arr ive into the ver tices.
¨ Partition the graph among available processors s.t.
¡ Each processor gets equal no of edges (load balance)
¡ Most edges have both endpoints on the same processor
(minimize communication)
¨ Classic, hard problem
¨ Many heuristics known.
¡ Spectral Methods
¡ Multi-level methods based on graph coarsening, e.g. METIS
¨ Good partitioning is possible for “well shaped
meshes” arising out of finite element methods.
Input: positions of n stars
Output: force on each star due to al l others.
¨ Naïve Algorithm: O(n 2), for each star cal culate the
force due to every other.
¨ Fast Multipole Method: O(n). Based on clustering
stars and consi dering cluster-cluster interactions +
star-cluster + star-star interactions.
¨ Star data resides at leaves of oct tree.
¨ Cluster data resides at internal nodes
¡ Computed by passing data along tree edges.
¨ Cluster-cluster interaction: data flows between
neighbours at same level of tree.
¨ Dataflow is “pyramidal”.
¡ Possible to nicely embed p yramids in most networks.
¡ More complexity because p yramid might have variable
number of levels if stars are distributed unevenly.
¨ “Find a combinatorial object satisfying properties
…”, e.g. packing, TSP, …
¨ Naturally parallel, but unstructur ed growth of
search tree.
¨ Difficulty in load balancing.
¨ Key question: can we maintain a distributed queue?
Best solution found? Bounds
¨ Prefix like computations useful in maintaining
distributed queue.
¨ Randomness also useful. Each processor asks for
work from a random processor when i ts work
queue runs out.
¨ Broadcasting required to maintain best solution,
bounds..
¨ Graph embedding
¡ Develop algorithm for one network, embed that networ k
on what you actually ha ve
¡ Form graph of dataflow/data dependence , embed that in
network you have
¨ Solve problem locally, merge global results
¡ Sorting
¡ Matrix multiplication: locally implement subblock
multiplication
¡ Useful in graph algorithms
¨ Randomization
¡ Useful in load balancing
¡ Also used in communication, select paths randomly
¡ Symmetry breaking
¡ Sampling
¨ Co-ordination
¡ Prefix is useful
¡ Distributed data structures, e.g. queues, hash tables
¨ Communication Patter ns
¡ All to all .. Sorting
¡ Permutation
¡ Broadcasting
¨ Vast subject
¨ Many clever algorithms
¨ Many models of computati on
¨ Many issues besides algorithms: how to express in
programming language..
¨ Quick tour of some of the i mportant ideas.

Das könnte Ihnen auch gefallen