Sie sind auf Seite 1von 14

Contents

Topic: Introduction to Parallel Algorithms Design and Techniques................................................2 Objective.......................................................................................................................................2 1. Introduction................................................................................................................................2 2. Designing Parallel Algorithms....................................................................................................3 3. Partitioning or Decomposition ...................................................................................................4 4. Communication and Synchronisation.........................................................................................5 5. Basic Parallel Programming Techniques....................................................................................5 6. Data Dependencies....................................................................................................................7 7. Load Balancing..........................................................................................................................9 8. Granularity.................................................................................................................................9 9. Speed Up Factor......................................................................................................................10 10. Case Studies..........................................................................................................................10 Summary.....................................................................................................................................13 Exercises....................................................................................................................................14

TOPIC: INTRODUCTION TO PARALLEL ALGORITHMS DESIGN AND TECHNIQUES


OBJECTIVE
This chapter gives an overview of how to approach a problem and design a problem in a parallel manner. The chapter starts with an insight into how a problem can be decomposed to solve it in parallel, touching upon aspects of data-dependencies and the measures of parallel algorithms The last section deals with specific parallel algorithms

1. INTRODUCTION
In general, sequential algorithms will not perform well on parallel machines. This is because the sequential algorithms are optimized for a single CPU. So, to develop algorithms in parallel, we need to have a different approach. When we design an algorithm to be solved on a parallel machine, then they are known as parallel algorithms. When designing a parallel algorithm, the primary aspect is to understand the inherent parallelism in a problem, i.e, how a given task can be decomposed into subtasks and whether we can achieve data parallelism or task parallelism. To understand this, let us take a simple example of summing n numbers in an array. On a sequential machine, the code we generally write is as follows: sum=0; for(i=0;i<n;i++) { sum+=a[i]; } Using OpenMP for shared memory, we saw that we could convert the above program into parallel by adding some pragmas as shown below sum=0; #pragma omp parallel for shared(sum) for(int i=0; i<n; i++) { #pragma omp critical sum += a[i]; } By adding these compiler pragmas we converted the existing serial code into parallel by asking the compiler to execute the various iteration chunks of the for loop to different threads, and the synchronization of the global shared variable sum, was achieved by adding the critical pragma, which tells the compiler that when the sum is being calculated and updated, no other thread should access it. Now, the above method is only a way to convert existing serial programs into parallel, and though we have achieved some level of parallelism, it is not the best way. This is because time is lost in

synchronization for updating the sum globally. The original program itself was not written thinking in parallel. Now to write the same program in parallel, one of the way could be, that we assign to each of the processor, some part of the for loop and these processors compute the sum of these individual sections and then add their local sums to the global sum. So, while computing the local sums, there would be no need for synchronization and while updating the global sum, the need for synchronization will arise. This will be a better program than the above. //Declare sum as shared variable sum=0; #pragma omp parallel for shared(sum) for(i=0;i<nproc; i++) // here, nproc is the number of processors { tempsum[i] = 0; for(j=0;j<n/nproc;j++) tempsum[i]+= a[i*nproc+j]; #pragma critical sum+=tempsum[i]; //UNLOCK Shared Variable sum }

Here, we see that the array is divided into nproc segments and added by different processors and their individual sums are then updated in the global variable sum, which is locked prior to access so that the concerned processor has mutually exclusive access to it and after updation, unlocks the lock. With the above simple example, we proceed ahead to understand parallel algorithms.

2. DESIGNING PARALLEL ALGORITHMS


We already know that there are compilers available like OpenMP which is a tool for parallelizing a serial code, by writing various compiler directives and to a large extent is effective in achieving that. However, to design optimum parallel algorithms, the programmer has to manually identify and implement parallelism, which is evident from the example in the previous section. The first step in designing a parallel algorithm is to identify whether a given problem is parallelizable or not. For example, the problem of computing matrix multiplication is definitely parallelizable because the array can be divided into sub-arrays and can be simultaneously calculated by multiple processors. However, the problem of Fibonacci series computation is not parallelizable because to calculate the next Fibonacci number, the previous two Fibonacci numbers are needed, hence there is a dependency in the calculations.

The next step is to identify the activities that can be done in parallel. Also, while looking at a program, we need to see which section of the code is consuming most of the time (these are termed as hot spots), and focus on parallelizing those sections on priority. When the tasks are identified, then the algorithm has to look at aspects as to whether these tasks are independent, whether there are any data dependencies, whether these tasks need to communicate with each other. Also, if there are any synchronization between the decomposed tasks. The remaining sections of this chapter, discusses the above steps in detail.

3. PARTITIONING OR DECOMPOSITION
When designing a parallel program we need to break the problem into discrete "chunks" of activities that can be distributed to different processors or threads. This is known as partitioning or decomposition. While decomposing a problem, the focus is only on analyzing areas for parallel execution. The focus is not on how many processors or threads are there in the system, etc. There are mainly two kinds of decomposition: 1. Task Decomposition or functional decomposition 2. Data Decomposition or Domain decomposition Task Decomposition or functional decomposition In this approach, the focus is on the computation that is to be performed or in other words different activities are identified as tasks and each task then performs a portion of the overall work. So, functional decomposition is applied wherever the problems can be split easily into different tasks. Data Decomposition or Domain decomposition Here, the data associated with a problem is decomposed and each of the parallel tasks then works on a portion of the data set. For example, if we need to add two arrays of 100 elements each, then we could divide this work, say between N processors, such that the each processor performs the addition of 100/N elements of the array.

Apart from the above methods of decomposition, occasionally there could also be a decomposition based on data flow, where the critical issue is how the data flows between different tasks. The example is the classical producer-consumer problem, where the output of one task, the producer, becomes the input of another task, the consumer.

4. COMMUNICATION AND SYNCHRONISATION


When we parallelize a problem, we can see that some types of problems can be decomposed and executed in parallel with no need for tasks to share any data. For example, in performing the addition of two arrays and storing the result in the third array does not require any communication between the tasks. However, most of the parallel algorithms require that the tasks communicate with each other. For example, the problem of calculating the sum of elements of an array, if each task is performing its local sum on a chunk of data, then it needs to update the global sum. Along with communication, come the issues of synchronization. How will these tasks access the global variable, and how to make these shared variables mutually exclusive? There are various methods of synchronization and the two main methods are lock or semaphores and barriers Lock or Semaphores: Locks or semaphores are used to serialize or protect access to a shared data or section of code. At a time only one task can use the lock or semaphore. Once a task acquires the lock, it can access the shared data or code and other tasks at that time will be waiting for this task to release the lock. After release of the lock, the next awaiting task gets the access. Barriers: Often at some point of the program, the tasks need to be synchronized. For example, a serial code needs to be executed at a point, after all tasks complete their work. Barriers are provided at these points in the program, when we want all the tasks to complete their work till they reach the barrier. At this point the tasks stop or blocks. When the last task reaches the barrier, all tasks are synchronized.

5. BASIC PARALLEL PROGRAMMING TECHNIQUES


Most of the time the parallelization effort is on the loops which carry out bulk iterative jobs. The two most common techniques used are known as loop splitting and self scheduling. Loop Splitting The loops are split in such a way that if there are n iterations in the loop, and there are p processors then each processor carries out n/p iterations of the task.

If the loop is being split by the programmer, then the most common method would be as follows:

// fork nproc processes or create nproc number of threads, with different // process or thread ids (proc_id) //Each of the process or thread will then execute the following for loop

for (i=proc_id;i<n;i=i+nproc) Body of the loop; Most of the processors have process id or thread id starting from 0 to p-1 integers for p processes or threads. So, one way the loop could be split-up is that the first iteration of the loop is carried out by the first process or thread, the second one assigned to the second process or thread and so on. So, if there are say three processors or threads, then the first processor executes for loop index variable 0, 3,6, the second processor executes from 1,4,7, and the third processor executes from 2,5,8.. and so on till all the iterations are complete. Another method of splitting the loop is by giving chunks of n/p blocks to each of the processors to work. The below code is in OpenMP using compiler pragma, and will divide the work in chunks of n/p blocks to each of the processors #pragma omp parallel for for(i=0;i<n; i++) Body of the loop; Self-Scheduling Sometimes, if the body of the loop or the work inside the loop depends on the loop variable and hence is small for certain index and long set of instructions for certain indexes, then the loop splitting will not achieve optimum performance. In such a scenario, we need the processes or threads to self-schedule. That means, as soon as a process or thread completes its work, it will come back and ask for the next work (in this case the next loop index will be assigned to the process/thread). Here, the index variable will be a shared memory variable, and will be accessed using the lock and unlock method of synchronization on the loop index. Each process or thread which gets access to the loop index will increment the loop for the next execution, and proceed to execute the body of the loop for the current index it read. The pseudo code section indicates self-scheduling. master thread: create worker threads() worker threads: while(1) { if (GetNextPendingWork(&i) != NULL) // self scheduling Do_Work(i) } Note: When using openMP, we can also use the Schedule (dynamic, chunk) pragma, where iterations are allocated to threads and once a particular thread finishes its allocated iteration, it returns to get another one from the iterations that are left. We also have control on the chunk of iterations that we can allocate to be executed at a time

6. DATA DEPENDENCIES
In the previous section, when executing the body of the loop, we assumed that the data we are dealing with in the body of the loop are independent data and are not dependent on which thread processes it first. But, this is not the case in the real world programming. The data will have dependencies. Dependence exists between program segments when the order of the execution of these segments affects the results of the program Understanding dependencies is important to parallel programming because it is one of the primary inhibitor to parallelism. While executing a set of parallel code, it is necessary that there should be no dependence of these sets of code. . Consider a pair of statements, STMT 1 and STMT 2. There could be four different types of dependencies between these statements: 1. True dependence or Flow dependence This dependence will occur if the value computed by STMT 1 is used in STMT 2 or when a write occurs before a read. Example: STMT 1: Z=X+Y; STMT 2: A=Z+1; Here, the second statement needs the value of Z which is computed by the first statement and hence depending upon the order of execution the result will vary. 2. Anti dependence This is the opposite of flow dependence. This dependence will occur if STMT2 modifies a variable that STMT1 reads assuming that STMT1 is before STMT2 in execution. Here, a read occurs before a write. Example: STMT 1: A=Z; STMT 2: Z=X+Y; The variable Z is read first and then written into in the second statement. So, the order of execution will produce different results 3. Output dependence This will occur if both STMT1 and STMT2 modify the same variable and STMT1 precedes STMT2 in execution. Here, a write occurs after a write. Example

STMT 1: Z=X; STMT 2: Z=Y; The variable Z is written into by both the statements, and hence the order of execution will produce different results of Z. Overcoming Data Dependencies Whenever we design parallel programs, the loop carried data dependencies have to be looked into. The loop dependencies are of many types: forward dependency, backward dependency, breaking out of a loop etc. Sometimes in a loop, we need to handle forward dependencies. Take a case of a code with forward dependency: for(i=0;i<n-1;i++) a[i]=a[i+1]; Here, we see that the loop is not parallelizable due to anti-dependency, since the source and the target arrays are the same. To solve the same, we could reorganize the array in the following manner: for(i=0;i<n-1;i++) b[i]=a[i]; for(i=0;i<n-1;i++) a[i]=b[i+1]; Transfer the array a to b first. Then use the array b to transfer the (i+1)th element to a. Both loops are parallelizable, but one loop gets split to two loops now and that makes the program inefficient, by increasing the space requirement (array b)

Another better way of solving the above problem is by block scheduling. In block scheduling, each processor executes the loop for a sequential set or block of loop indexes. So, the problem of dependencies will only be at the border or the end of each of the blocks. So, if we store the first element of the next block in an array, we need to store only p elements, where p is the number of processes or threads. The program could now be modified as: //store the first elements of the second block onwards for(i=0;i<p;i++) b[i]=a[(i+1)*n/p]; barrier //all processes/threads will have to wait till complete for(i=0;i<p;i++) // execute this loop in parallel { idx = i*n/p; for(j=0;j<(n/p)-1;i++)

a[j+idx]=a[j+1+idx]; a[(i+1)*n/p1]= b[i];//taking care of the last block elements } Let us look at a problem of backward dependency. For example, for(i=1;i<=n;i++) a[i]=a[i-1] + b[i]; In the above example, the value of a[i-1] must be computed before a[i] and hence this loop has a data dependency, because of which parallelism is inhibited and the next task can compute a[i] only after the previous task has computed a[i-1]. If we carefully look at the code, a[1] = a[0] + b[1] a[2] = a[1] + b[2] = a[0] + b[1] + b[2] a[3] = a[2] + b[3] = a[0] + b[1] + b[2] + b[3] Hence a[n] = a[n-1] + b[n] = a[0] + b[1] + b[2] + b[3].. + b[n] This has eliminated the dependencies, but we can see that for each calculation of the element a[i], we need to compute a cumulative sums of the array elements from b[0] to b[i]

Many a programs exit out of a loop using the syntax break based on certain condition. This is an example of control dependence. In such a scenario, there will be problems if we try to assign the loop tasks based on the value of index variables, beyond the value it needs to break, to multiple processors. So, in loops where there is a breaking out of the loop, the parallelizing has to be done, studying the loop and identifying at what value of the index variable, the break will take place etc.

7. LOAD BALANCING
Load balancing refers to the way in which the work is distributed among all the processes or threads so that all of them are kept busy all of the time. Hence we minimize the idle time of the processors and improve performance. For example, suppose, we have created p threads, and assigned work to these threads. To synchronize these threads, we have, say added a barrier, and then the slowest thread will determine the overall performance.

8. GRANULARITY

The number of tasks into which a problem is decomposed determines its granularity. If the tasks are decomposed into smaller numbers, then it is known as coarse grained decomposition, whereas if the tasks are decomposed into larger numbers then it is known as fine grained decomposition. Hence, the parallel time can be reduced by making the decomposition finer in granularity. However, If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks will take longer time than the computation itself. So, while designing a parallel algorithm, the level of granularity is chosen based on the algorithm itself and the hardware environment in which it runs.

9. SPEED UP FACTOR
One of the important measures of a parallel program is the speed up factor. Speed up factor gives the increase in speed by using multiple processors. To compute this factor, we consider the best sequential algorithm with single processor system and the algorithm for parallel implementation might be different. Let Ts(n) be the time required by the sequential algorithm to solve a problem of size n Let Tp(n) be the time required by the parallel algorithm to solve a problem of size n Then the speed-up of the parallel algorithm is given by: S = Ts(n)/Tp(n) If the number of processors are p, then efficiency is given by: E = S/p

10.

CASE STUDIES

1. Matrix Multiplication To multiply two matrices, the code section in the serial mode is as follows:
// read matrices A[M][N] and B[N][O] for(i=0;i<M;i++) { for(j=0;j<O;j++) { C[i][j] = 0; for(k=0;k<N;k++) C[i][j]+=A[i][k]*B[k][j]; } }

To convert this into parallel, we see that there are three for loops which can be parallelized. The best option would be to parallelize the outermost loop. The code section in openMP for parallelizing the matrix multiplication could be as follows:

#pragma omp for schedule (static, chunk) for (i=0; i<M; i++) { for(j=0; j<N; j++) for (k=0; k<O; k++) c[i][j] += a[i][k] * b[k][j]; } } /*** End of parallel region ***/

The above code indicates that the outer loop is split and scheduled as static and chunks of the loop is provided to the various threads available in the system. So, each thread will calculate a set of rows of the c matrix. The division of work is coarse grain as entire set of rows of the C matrix is given to the threads. If we try to parallelize the inner loops, then we achieve fine grain parallelism. 2. Sorting Algorithms Here, we will discuss two sorting algorithms: the odd-even transposition sort (variant of bubble sort) and the quick-sort algorithm. a) Bubble Sort The bubble sort technique is difficult to parallelize, however, the odd-even transposition (serial) sort can be considered for parallelization.

for (i=0; i<n; i++) { if ((i%2)== 0){ for (j=0; i<n/2-1; j++) comp_exchg(a[2*j], a[2*j+1]); } else { for (j=1; i<n/2-1; j++) comp_exchg(a[2*j+1], a[2*j+2]); }

The function comp_exchg() compares the two elements it takes as argument and ensures that the smaller element is before the larger element. After n phases of odd-even exchanges, the sequence is sorted. We also know that each phase of the algorithm (either odd or even) requires (n) comparisons and the serial complexity is (n2) We observe that the comparisons in each phase (odd or even) are independent and hence we can trivially parallelize this algorithm. So, if we have P1 to Pn processors, to sort n numbers,

and each of these processors can communicate with its neighbors, then we can parallelize as shown in the code section below:
//thread_id is the id of the particular thread or process for (i=0; i<n; i++) { if ((i%2)== 0){ if ((thread_id%2) == 0) comp_exchg_min(thread_id+1); else comp_exchg_max(thread_id-1); } else { if ((thread_id%2) != 0) comp_exchg_min(thread_id+1); else comp_exchg_max(thread_id-1); }

Since there are n processors and there are n iterations, in each iteration, each processor does one compare-exchange operation. Hence the parallel run time of this formulation is (n). This is cost optimal when compared to the serial algorithm. b) Quick Sort Quick-sort algorithm works by selecting a pivot element, moving all elements less than the pivot to one side, and the other elements to the other side of the array. If there are more than one element in one of the two subsequences, it then recursively calls itself passing the sub-set. Algorithm for quick-sort is broadly written as: Divide: pick a random element e which is the pivot element and partition the array to be sorted into Partition 1: elements less than the pivot element e Partition 2: elements greater than the pivot element e Recursion: recursively divide Partition 1 and Partition 2 if partitions have greater than one element. Conquer: join Partition 1, pivot and the Partition 2 Since Quick-sort has a divide and conquer nature, it can be easily parallelized. The average running time of sequential quick-sort is O(n log2 n) and a worst case running time of O(n2). However, the number of worst cases encountered can be reduced by choosing the pivot elements with some intelligence. Since in quick-sort, the top-level divide involves all the data and must be done sequentially, it will not provide an optimal performance when parallelized.

One way of parallelizing is by starting a thread or a process for each recursion. Individual partition operations are difficult to parallelize, after the partition is achieved, the different sections of the list can be sorted in parallel. Suppose we have n elements, and p processors, we can divide a list of n elements into p sub lists in (n) average time, then sort each of these in (n/p log n/p) average time. Given below is a section of the quick-sort code for openMP code, where the recursions are done in parallel using the section pragma.
void QSort (int num[], int llimit, int ulimit) { if (llimit < ulimit) { int pivotindx = make_partition(num, llimit, ulimit); //The recursions are executed in parallel #pragma omp parallel sections { #pragma omp section QuickSort (numList, llimit, pivotindx - 1); #pragma omp section QuickSort (numList, pivotindx + 1, ulimit); } } }

The problem with Parallel Quicksort is that the partitioning step must be done before going parallel with the sub-lists, which limits parallelism. The solution can also be to do the partition step in parallel.

SUMMARY
In general, sequential algorithms will not perform well on parallel machines. This chapter deals with the the design of parallel algorithms. When designing a parallel algorithm, the primary aspect is to understand the inherent parallelism in a problem, i.e, how a given task can be decomposed into subtasks and whether we can achieve data parallelism or task parallelism. This chapter has delved into how a problem can be decomposed to solve it in parallel. The various aspects of data dependency and the means to overcome these dependencies have been discussed. Issues of load balancing, granularity of tasks, and speed-up factor have been addressed. The last section deals with the design of specific parallel algorithms on matrix multiplication and sorting algorithms like odd-even transposition sort and quick sort.

EXERCISES
1. Using openMP parallel programming generate all the prime numbers from 1 to n 2. Sort a given set of elements using Merge sort method and determine the time required to sort the elements. Repeat the experiment for different values of n, the number of elements in the list to be sorted and plot a graph of the time taken versus n. Parallelize it using openMP and plot a graph to compare with the sequential code

Das könnte Ihnen auch gefallen