Sie sind auf Seite 1von 9

1

Parallel Heuristic Graph Matching Algorithm for Task Assignment Problem in Distributed Computing Systems
R Mohan1 N P Gopalan2 Dinesh Prasanth S H3
1

Siddhant Sanyam4

Saagar R Varma5

Assistant Professor, Dept. of Computer Science and Engineering National Institute of Technology Tiruchirapalli Tamil Nadu 620015, India rmohan@nitt.edu

Professor, Dept. of Computer Applications National Institute of Technology Tiruchirapalli Tamil Nadu 620015, India npgopalan@nitt.edu

3,4,5

Dept. of Computer Science and Engineering National Institute of Technology Tiruchirapalli Tamil Nadu 620015, India mail2dineshnow@gmail.com siddhant3s@gmail.com saagarvarma18@gmail.com

AbstractTask assignment is one of the most challenging problems in distributed computing environment. An optimal task assignment guarantees minimum turnaround time for a given architecture. Several approaches of optimal task assignment have been proposed by various researchers ranging from graph partitioning based tools to heuristic graph matching. Using heuristic graph matching, it is often impossible to get optimal task assignment for practical test cases within an acceptable time limit. Some researchers have tried to solve this problem by following a divide and conquer strategy

and have successfully applied it to nd optimal task assignment on the processors constituting a node of a cluster of multi-processors giving acceptable assignments within acceptable time limits. In this paper it is attempted to parallelize the basic heuristic graph-matching algorithm of task assignment put forward by previous research. Processors to which the task assignment has been carried over are made asymmetric by assigning different costs for each task to execute on different processors when assigned to them. Results show that near optimal assignments (>90%) are obtained much

faster than the sequential task assignment in all the cases. Index TermsParallel task assignment, load balancing, heterogeneous processors, asymmetric multiprocessors, heuristic graph matching.

I. I NTRODUCTION Load balancing is the process by which task modules constituting an application are assigned to processors, with the goals of maximizing the processor utilization and minimizing the total turnaround time which can be viewed as a task assignment problem by which task modules are assigned to available asymmetric multi-processors in order to achieve the aforesaid two goals. This can be achieved by a number of techniques viz. graph partitioning, graph matching (the 2 most widely used ones), hybrid methodology and mathematical programming. In this paper, a parallel graph matching algorithm is presented. A. Graph Partitioning based Methodologies Graph partitioning techniques view the task as a task graph where the vertices represent the task modules and edges represent the communication between those tasks. Load balancing is a process that involves graph partitioning methodologies which with the help of inter-node communication produce equal partitions of the task graph. The number of partitions depends on the number of processing elements and their topology in the processor graph. Some factors that are involved in selecting the number of partitions are: Load balance The computational work of each processor should be balanced, so that no processor will be waiting for others to complete. Assuming that the computational work per processor is proportional to the number of task modules in the partition, to achieve load balance, it is necessary that the number of task modules in each partition to be the same. Communication cost On a parallel computer, accumulating the contributions from nodes that are not on the current processor will incur communication cost. It is known that on distributed memory parallel computers the cost of accessing remote memory is far higher than that of accessing local memory (typically a ratio of between 10 to 1000). It is therefore important to minimize the communication cost.

Steady research showing advances have renewed interest in the Graph bisection problem. The graph bisection problem has been studied in the past by many authors (see, e.g., [4], [5], [6]) in the context of graph theory as well as VLSI circuit layout. This problem is a NP hard problem, so there are no known algorithms that can nd the exact solution to the problem in polynomial time. Most of the graph bisection methods, therefore, seek a good approximation to the optimal partitioning that can be calculated efciently. Some already proposed algorithms are Recursive graph bisection (RGB) algorithm [6] and K-L (KernighanLin) algorithm ([4], [8], [9]) Although the multilevel approach of Kernighan-Lin reduces the computing time signicantly, for a very large task graph it can prove to be memory intensive - often exceeding the limits of single CPU memories. Furthermore as the ultimate purpose of partitioning is for the subsequent implementation of the application code on parallel machines, it makes sense to have a parallel partitioning code. Besides, a fast parallel partitioning algorithm can also be used for the purpose of dynamic load balancing. There have been a number of efforts in this area. Different techniques like matching parallel process to a hypercube and spectral bisection have been pursued in [10], [11], [12]. However, our research focuses primarily on the problem of task assignment using parallel graph matching. B. Heuristic Graph Matching This is a graph matching based method, which uses a task-graph and a processor-graph. While the task-graph denotes the dependency amongst the task modules, the processor-graph denes the topology of interconnection amongst the processors. A classical example of this is the work by Shen and Tsai et al [3] which uses the well-known A* algorithm to nd the optimal task assignment. C. Motivation for the work done Parallel graph partitioning algorithms exist and have been used in numerical test cases all across the spectrum. This led us to question the well known fact that even though parallel graph partitioning algorithms exist, there is not much research on parallel graph matching algorithms. Graph matching algorithms like the one proposed by Shen and Tsai et al [3] are extremely useful when they can perform more efciently

by parallelizing them. A parallel graph matching methodology is likely to reduce the size of the statespace handled by each parallel graph-matching task, thus producing acceptable mapping within acceptable time limits. Thus parallelization of the heuristic graphmatching algorithm of task assignment put forward by Shen and Tsai et al in asymmetric multi-processors and identical inter processor communication links is performed. The methodology is tested on a few representative test cases with Message Passing Interface (MPI). The results show that near optimal assignments are obtained in high efciency than the sequential program in all the cases. II. PARALLEL G RAPH M ATCHING A LGORITHM In this section, the original sequential algorithm proposed by Shen et. al. in [3] is rst presented. This algorithm is then analyzed and the portions which can be parallelized are identied. Finally, the parallel graph-matching algorithm is presented and explained with an illustrative example. A. Heuristic Graph Matching the sequential algorithm As already mentioned in previous section, the mapping implies assignment of any one or more of the n task modules to any one or more of the m processors with no task module assigned to more than one processor. This method starts by an initial mapping (any one task module assigned to any one processor) and expanding the state-space by generating other permissible mappings. Each mapping or state-space entry has a cost function associated that is denoted by f . In [3], this cost function is expressed in terms of a single entity viz. time and may be thought to be composed of two parts viz. g which may be viewed as the cost of generation of the state-space entry and h, which may be viewed as the cost to generate the goal-state from the present state-space entry and is the heuristic weight associated with the state-space entry. Thus, for each mapping or state-space entry, f =g+h (1)

1) Computation of g: If there is a task with n modules 1, 2, 3. . . n and m processor p1 , p2 , p3 . . . pm in a distributed computing system, an n by m matrix X could be used to represent the mapping M from processors to tasks, where 1, if module i is assigned to processor j 0, otherwise (2) Then, total execution time of all the modules assigned to processor pj , denoted by P Tj , is dened by the following formula, where mi is the ith module and Cp (mi pj ) denotes the computational time of module mi on processor pj xij =
n

P Tj =
i=1

Cp (mi pj ).Xij

(3)

Then, total execution time of all the modules assigned to processor pj , denoted by P Tj , is dened by the following formula, where mi is the ith module and Cp (mi pj ) denotes the computational time of module mi on processor pj 1, if processor i and j are connected 0, otherwise

Lij =

(4) And we could use another m by n matrix R to represent the communication relationship between each pair of modules. 1, if modules i and j communicates 0, otherwise

Rij =

(5) With the denition of matrix L and R, the following formula could be used to calculate the total communication time spend on processor j CTj = k=1 s=1 (pj , pk )).Xsj .Xtk .Rst .Ljk
m n n i=1

Cc ((ms , mt )

As long as there is an upper bound hmax for h, i.e. h hmax , the A* algorithm guarantees that an optimal solution is found [?]. Thus with g = 0, the search becomes a purely heuristic search and with h = 0 it becomes a best-rst search [?]. The computation of g and h are explained in the following sections.

With the calculation of execution time and the communication time, the turn around time of processor j is equal to T Aj = P Tj + CTj (6)

Based on the knowledge that the turnaround time of distributed computing system is the maximum of

the turnaround time of all the processors; the coming formula denes the cost of the mapping g g = COST (mapping) = maxljm (T Aj ) (7)

h=

Ia

(10)

2) Computation of h: Step 1 For a certain mapping M , let k be the processor that satises equation 8 below, i.e. the processor with maximum turnaround time. COST (M ) = COST (T Ak ) = maxljm (T Aj ) (8) Step 2 Let M K be the set of modules assigned to the processor k by the partial mapping M ; Let M K be the set of modules that are not assigned with the partial mapping M ; Let B M K denote the set of modules that communicate with one or more modules on processor k; Let LK be the set of edges between those modules being assigned to k and those not assigned to any processor yet: LK = {(a, b)/a M K, b M K }, a b Step 3 Find the set La = {b/(a, b) LK} for each a B as dened above in step 2. La is the set of modules that were assigned by partial mapping M with the processor k and at the same time communicate with the module a B. Step 4 For each processor p connecting to processor k, calculate the summation of communication cost between all the modules in processor p with module a B in processor k, where the summation is denoted by CM (p, a). Dene ACa as the minimum of CM (p, a) for all the processor p connecting to processor k . ACa is dened as innite in the case that there exists no processor connected to processor k. Dene APa be the computational cost for execution module a B when assigned to processor k. Step 5 Let Ia be the minimum between ACa and APa : Ia = min(ACa , APa ) Step 6 The heuristic estimate h could be calculated by (9)

The optimal task assignment is then found using the following steps: Step 1 Put the initial node K = / on a list called OP EN , and o set f = 0 where f is an evaluation function dened in formula 1 Step 2 Remove from OP EN the node n in a list called statespace with a smallest f value and put it on a list called CLOSED. Step 3 If n satises the goal state dened in the previous section, make the corresponding graph mapping M as the desired mapping. Otherwise, continue. Step 4 Expand node n, using operators applicable to n, and compute the value f = g + h for each successor n of n where g is computed using equation 6 and h using equation 6. Put all the successor of n on OP EN . Step 5 Continue from Step 2. As long as the value of h is lower than the upper bound hmax , this algorithm always yields the optimal task assignment. If it is attempted to map n task modules on m processors, then this algorithm has a complexity O(n2 ) for moderate sizes of n (typically n < 20) but the complexity approaches O(n2 mn ) as n increases. Moreover, the size of the state-space and hence the resources required by the A* algorithm also increase substantially. B. Parallel graph-matching algorithm The basic methodology proposed by Shen and Tsai is based on generate and test mechanism. Generation of state space nodes expands the state space and incurs computation time as successive nodes are generated and the associated costs are computed. The parallel graph matching algorithm parallizes the generate mechanism, thus dividing the state space into several smaller state-spaces. The basic steps involved are as follows Let N be the number of parallel graph matching tasks. Let T = (VT , ET ) represent the task graph P = (VP , EP ) represent the processor graph Let

Pi = (Vpi , Epi )be a sub graph of P , which is used by the ith task for mapping. The number of sub graphs of P is assumed to be equal to the number of parallel graph matching tasks. Each graph-matching task is assumed to follow the steps 1 to 5 followed by the sequential algorithm, the only difference being the fact that the node for expansion is the one with minimum value of f computed across all the parallel tasks. For this purpose, it is further assumed that the parallel tasks send to each other the mapping state corresponding to the entry with minimum value of f after when a xed number of new entries accumulate to the local mapping state space of each parallel task. This xed number variable is dened as nodecount. Each parallel graph-matching task proceeds as follows: Step 1 Put Ki = / on a list OP EN and set f (Ki ) = 0 when o f is the evaluation function dened by equation 1 If Mglobal represents the global optimal mapping, i.e. the mapping with smallest value of f found by all graph matching tasks, then initialize this to Ki Step 2 Set n = Mglobal Step 3 If n represents the state with no unmapped task, with the smallest value of f among all OP EN nodes, or the number of new additions to the state-space equals nodecount then send this optimal mapping (Mlocal ) to all parallel graph-matching tasks. Also wait for others to send in their Mlocal . Find the mapping with minimum value of Mlocal and set it to Mglobal . If Mglobal has no unmapped tasks, this is the desired optimal task. Otherwise set n = Mglobal and continue. Step 4 Remove from OP EN the node n with the smallest f value and put it on a list called CLOSED. Step 5 Expand the node n, using operators applicable to n and compute f (n ) = g(n ) + h(n ) for each successor n of n. It is to be noted that the ith graph matching task generates the successors by expanding the mapping corresponding to n by adding successive task modules to processors represented by the set Vpi only. Go back to Step 3.

Figure 1.

A representative task graph with 6 nodes

It is clear that the value of nodecount determines how frequently the parallel graph matching tasks communicate. If this value is small, the tasks communicate too often and this decreases the turnaround time as far the corresponding mapping is concerned thus improving the optimality. If this is too large, then the solution cannot nd optimal solution, as many possibilities remain unexplored.

III. R ESULTS AND D ISCUSSION A. Results using representative test cases Two representative test cases are presented in Fig 1 and Fig 2 respectively. Each of these gures represents task graphs. In Fig 1 and Fig 2, the vertices vi , vj represent task modules and the edge (eij ) represent the connection between the vertices vi and vj . Computation time of each module on various processors is represented by the matrix T P . The numbers associated with the edges represent the communication time associated with the data transfer associated with the edge eij connecting two vertices vi and vj . The computation time and communication time are represented in msec associated with the vertices and edges. In Fig 1, the number of nodes in the task graph is 6, which means that there are 6 modules dened by T = {0, 1, 2, 3. . . .5} which need to be mapped. The computation time associated with these modules dened by the matrix T P with the ith column in T P denoting the computation time associated with the ith element in T . The inter module communication is dened by the matrix C. 10 15 5 20 15 10 TP = 12 13 7 18 13 13

Figure 2.

A representative task graph with 12 nodes

C =

0 .5 0 .5 0 0

.5 0 .8 0 .2 0

0 .8 0 .1 0 .2

.5 0 .1 0 0 0

0 .2 0 0 0 .3

0 0 .2 0 .3 0

inter processor communication, between two modules irrespective of the processors they map. In each of these cases, the basic mapping algorithm of Shen and Tsai is used. First the heuristic function h(n) is assumed to be 0 for the state space based search algorithm. Next, the test is repeated with a nonzero h(n). Each case is compared to corresponding sequential implementation. For the test case represented, the node_count is varied and the results are shown in following tables. An index called the optimality index denoted by is introduced to quantify the result. The optimality index is dened as: = Optimal turnaround time (Sequential) (11) Optimal turnaround time (P arallel)

Similarly, an index is dened as = N umber of nodes generated (P arallel) N umber of nodes generated (Sequential) (12)

Similarly, In Fig 2, the number of nodes in the task graph is 12, dened by T = {0, 1, 2, 3. . . 11} which need to be mapped. The computation time associated with these modules dened by the matrix T P . The inter module communication is dened by the matrix C The Matrix T P : 10 8 15 5 20 15 10 10 11 9 16 12 11 8 5 2 1 5 10 2 6 1 7 12

The test is then repeated with the test case represented by Fig 2 and the results are shown in Table II. Next, the test cases are repeated with non-zero h(n). The results for test cases are represented in Table III for and Table IV respectively B. Discussions and Conclusions

The Matrix C: 0 .5 0 .5 0 0 0 0 0 0 .3 0 .5 0 .8 0 .2 0 0 0 0 0 0 .6 0 .8 0 .1 0 .2 0 0 .5 0 .4 0 .5 0 .1 0 0 0 0 0 0 0 0 .6 0 .2 0 0 .3 .3 0 0 0 0 0 .6 0 0 .2 0 0 0 .6 0 .3 0 0 0 0 0 0 0 0 .6 0 .5 0 0 0 0 0 0 0 0 0 0 .5 0 .2 .4 0 0 0 0 0 0 .5 0 0 0 0 0 .3 0 0 0 .2 .4 0 .5 .5 0 0 .3 0 0 .3 0 .4 0 0 0 0 0 0 .3 0 0 0 .6 0 .6 .6 0 0 0 0 0 0 0

The number of processors involved is assumed to be two to start with. The mapping algorithm for two processors runs in two parallel matching tasks and unless the modules are mapped to the same processor, the communication link speed is same, as that assumed for

The results presented in Tables I and III show that following: 1) As the value of node_count increases, the size of the search state-space reduces. 2) As the value of node_count is varied, optimality index also varies. It is maximum at a certain value of the ratio , where node_count = (13) nos_nodes The variable nos_nodes represents the number of task modules. While denes the quality of solution reported by the parallel implementation as a ratio of the sequential implementation by the actual optimal solution reported, denes the reciprocal of efciency of the parallel implementation in terms of the numbers of state spaces generated to nd optimal solution. From

Table I R ESULTS FOR A TASK GRAPH WITH 6 TASK MODULES SETTING H ( N ) = 0 node_count Optimal Mapping Turnaround Time(msec) No.of Nodes generated 28 9 6 6 No.of Nodes (Sequential) 36 36 36 36 Optimality index

2 3 4 6

0A 1A 2A 3B 4B 5A 0B 1B 2A 3A 4B 5A 0B 1B 2B 3A 4A 5A 0B 1B 2B 3B 4B 5A

51.90 41.60 46.00 63.50

0.76 0.95 0.86 0.62

Table II R ESULTS FOR A TASK GRAPH WITH 12 TASK MODULES SETTING H ( N ) = 0 node_count Optimal Mapping Turnaround Time(msec) No.of Nodes generated 702 53 27 13 12 13 No.of Nodes (Sequential) 732 732 732 732 732 732 Optimality index

2 3 4 6 8 10

0A 1A 2B 3A 4B 5B 6B 7B 8B 9B 10B 11A 0B 1B 2A 3A 4B 5A 6A 7A 8A 9A 10B 11B 0B 1B 2B 3B 4A 5A 6A 7A 8A 9A 10A 11B 0B 1B 2B 3B 4B 5A 6A 7A 8A 9A 10A 11A 0B 1B 2B 3B 4B 5B 6B 7A 8A 9A 10A 11A 0B 1B 2B 3B 4B 5B 6B 7B 8B 9A 10A 11A

61.60 57.90 65.40 68.50 88.80 95.40

0.83 0.88 0.78 0.75 0.58 0.54

Table III T EST CASE FOR TASK GRAPH OF F IG . 1 WITH 12 TASK MODULES WITH h(n) = 0 node_count Optimal Mapping Turnaround Time(msec) No.of Nodes generated 27 9 6 6 No.of Nodes (Sequential) 36 36 36 36 Optimality index

2 3 4 6

0B 1B 2B 3A 4A 5B 0B 1B 2A 3A 4B 5A 0B 1B 2B 3A 4A 5A 0B 1B 2B 3B 4B 5A

52.00 41.60 41.60 65.50

0.76 0.95 0.86 0.60

results presented in the above tables, it is clear that higher is the value of , higher is the value of . This is because of the fact that to achieve a higher value of , the parallel graph matching tasks must communicate more often, thus value of is high and decreasing the efciency.

Figure 3.

Variation of , with ,h(n) = 0 for test case 1

For each case, the variation of indices and with the ratio is plotted and the plots are represented in Fig 3, Fig 4, Fig 5 and Fig 6. Fig 3 and Fig 4 represent the test case of Fig 1 with h(n) = 0 and h(n) = 0 respectively. Similarly Fig 5 and Fig 6 represent the test case of Fig 2 with h(n) = 0 and h(n) = 0. The plots in solid line represent and plots in dashed lines represent

Table IV T EST CASE FOR TASK GRAPH OF F IG . 2 WITH 12 TASK MODULES WITH h(n) = 0 node_count Optimal Mapping Turnaround Time(msec) No.of Nodes generated 564 44 23 13 12 13 No.of Nodes (Sequential) 552 552 552 552 552 552 Optimality index

2 3 4 6 8 10

0A 1B 2B 3A 4A 5B 6B 7B 8B 9B 10B 11A 0B 1B 2A 3A 4B 5A 6A 7A 8A 9A 10B 11B 0A 1A 2A 3B 4B 5B 6A 7A 8A 9B 10B 11B 0B 1B 2B 3B 4B 5A 6A 7A 8A 9A 10A 11A 0B 1B 2B 3B 4B 5B 6B 7A 8A 9A 10A 11A 0B 1B 2B 3B 4B 5B 6B 7B 8B 9A 10A 11A

60.90 56.40 63.01 68.50 88.80 95.40

0.84 0.90 0.81 0.75 0.58 0.54

Figure 4.

Variation of , with ,h(n) = 0 for test case 1

Figure 5.

Variation of , with ,h(n) = 0 for test case 2

Figure 6.

Variation of , with ,h(n) = 0 for test case 2

The plots indicate that the variation of and with follows the same pattern for all the 4 cases. From the plots it is seen that a mapping which is 90% optimal ( 0.9) is obtained for 0.5 in all cases and the corresponding value of lies between 0.1 and 0.3. This means that a 90% optimal solution is obtainable at roughly one-third effort by the parallel implementation when compared to sequential implementation.

It is further seen that with h(n) = 0, the value of reduces much faster and increases much faster as is increased. This means that the heuristic search further results more optimal solution and increases the efciency of parallel graph matching algorithm. As when the node_count increases, the efciency reaches a stable constant while the optimality keeps reducing constantly. IV. C ONCLUSION This paper establishes a methodology by which the original heuristic graph matching algorithm can be parallelized in a typical asymmetric multi-processor environment so that large practical test cases can be handled. The parallel implementation actually uses divide and conquer policy by which the size of the state-space is reduced and hence the complexity. This is because of the fact that each parallel task separately tries to map the tasks on to a selected number of processors. The methodology is proved effective in cases where all processors are asymmetric in nature and inter processor communication links are identical. V. F URTHER W ORK The following points are identied for further research: To investigate for multiprocessor (>2) and nonidentical inter processor links cases. To investigate behavior of the parallel implementation for large test cases. To investigate the use of a heuristic bound to eliminate expansion of non-promising nodes in the state-space. Study of the curve in the graph plot on node_count variations. R EFERENCES
[1] G AO , H. S CHMIDT, A. G UPTA and L UKSCH P., Load balancing for Spatial-Grid-Based Parallel Numeric Simulation on Clusters of SMPs, Proceeding of the Euromicro PDP2003 conference, February 5-7,2003, Genoa, Italy, IEEE Computer Society Publications, pp. 75-82. G AO , H. S CHMIDT, A. G UPTA and L UKSCH P., A Graphmatching based Intra-node Load balancing Methodology for clusters of SMPs, Proceedings of the 7th World Multiconference on systems, Cubernetics and Informatics (SCI 2003), July 2003. C HIEN - CHUNG S HEN and W EN -H SIANG T SAI, A Graph Matching Approach to Optimal task assignment in Distributed computing systems using a Minimax Criterion, IEEE Transactions on Computers, vol. C-34,No.3, March 1985.

[4] [5] [6] [7] [8] [9]

[10] [11]

[12] [13]

B.W.K ERNIGHAN and S.L IN, An efcient Heuristic procedure for partitioning graphs, Bell Systems Tech, J. 49(1970), pp 291-308. B.W.Kernighan and S.Lin, An efcient Heuristic procedure for partitioning graphs, Bell Systems Tech, J. 49(1970), pp 291-308. T.N.B UI, S.C HAUDHURI, F.T.L EIGHTON, M.S IPSER, Graph Bisection Algorithms with good average case behavior, Combinatorica, 7 (1987), pp.171-191. R.D.W ILLIAMS, Performance of Dynamic load balancing algorithms for unstructured mesh calculations, Concurrency: Practice and Experience, 3 (1991), pp. 457-481. C.Farhat, A simple and efcient automatic FEM domain decomposer, Computers and Structures, 28(1988), pp.579-602. C.M F IDUCCIA and R.M.M ATTHEYSES, A liner-time heuristic for improve network partitions, ACM IEEE Nineteenth Design Automation Conference Proceedings, vol.1982, ch.126, pp 175-181, 1982. P.S ADAYAPPAN, F.E RCAL and J.R AMANUJAM, Cluster Partitioning approach to mapping parallel program onto a hypercube, Parallel Computing, 13(1990), pp. 1-16. S. T. BARNARD and H. D. S IMON, A parallel implementation of multilevel recursive spectral bisection for application to adaptive unstructured meshes, in: SIAM Proceedings Series 195, D.H. BAILEY, P. E. B JORSTAD, J R G ILBERT, M. V. M ASCAGNI, R. S. S CHREIBER , H. D. S IMON , V. J. T OR CZON , J. T. WATSON , eds., SIAM, Philadelphia, 1995, pp. 627-632. S. T. BARNARD , PMRSB: parallel multilevel recursive spectral bisection, Manuscript, 1996. R. M OHAN and A MITAV G UPTA, Graph Matching Algorithm for Task Assignement Problem, International Journal of Computer Science, Engineering and Applications Vol. I No. 6, December 2011

[2]

[3]

Das könnte Ihnen auch gefallen