Sie sind auf Seite 1von 190

PARALLELISING SERIAL CODE: A COMPARISON OF THREE HIGH-PERFORMANCE PARALLEL PROGRAMMING METHODS

A thesis submitted to the University of Manchester for the degree of Master of Philosophy in the Faculty of Science and Engineering

January 1997

By Jon MacLaren Department of Computer Science

Contents
Abstract Declaration Copyright Acknowledgements 1 Introduction
1.1 1.2 1.3 1.4 Motivation . . . . . . . Aims . . . . . . . . . . Related Work . . . . . Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9 11 12 13 14
14 16 17 17

2 Evaluation Criteria
2.1 High Performance Computing 2.1.1 Parallel Computing . . 2.2 Maintainability of Code . . . 2.3 Performance . . . . . . . . . . 2.4 Ease of Use . . . . . . . . . . 2.5 Bias . . . . . . . . . . . . . . 2.6 Summary . . . . . . . . . . .

19
19 21 21 25 27 29 29

3 Methods to be Evaluated
3.1 Beyond von Neumann . . . . . . . . . . . . 3.2 Shared Memory Model . . . . . . . . . . . . 3.3 Distributed Message Passing Model . . . . . 3.3.1 MPI { the Message Passing Interface 3.4 Bulk Synchronous Parallel . . . . . . . . . . 3.5 Target Architectures and their Compilers . . 3.5.1 SGI Challenge Series . . . . . . . . . 3.5.2 Kendall Square Research KSR1 . . . 3.5.3 Cray Research Incorporated T3D . . 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31
32 36 40 41 43 46 47 50 52 55

4 Developing Distributed Memory Model Code


4.1 Issues with Distributed Memory Codes . . . . . . . . . 4.2 Overview of Method . . . . . . . . . . . . . . . . . . . 4.3 Serial Code to Threads and Barriers . . . . . . . . . . 4.3.1 Parallelising Routines and Functions Separately 4.3.2 Parallelising a Single Routine or Function Body 4.4 Progression of Data Distributions . . . . . . . . . . . . 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . .

57
58 60 63 64 67 70 72

5 Development of Codes
5.1 5.2 5.3 5.4 5.5 5.6 Serial Code . . . . . . . . . . . . . . . . . . . Devectorising for Coarse Grained Parallelism . Shared Memory Code . . . . . . . . . . . . . . Bulk Synchronous Parallel Code . . . . . . . . Message Passing Interface Code . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73
74 78 83 91 93 94

6 Results
6.1 Maintainability . . . . . . . . 6.1.1 Readability . . . . . . 6.1.2 Development Potential 6.1.3 Portability . . . . . . . 6.1.4 Overview . . . . . . . 6.2 Performance . . . . . . . . . . 6.2.1 Practical Issues . . . . 6.2.2 SGI Challenge . . . . . 6.2.3 KSR1 . . . . . . . . . 6.2.4 Cray T3D . . . . . . . 6.2.5 Overview . . . . . . . 6.3 Ease of Use . . . . . . . . . . 6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

95
96 96 98 106 108 108 108 113 119 127 132 132 140

7 Conclusions
7.1 7.2 7.3 7.4 A.1 A.2 A.3 A.4 A.5 Discussion of Results . . . . . Evaluation of Work . . . . . . Extensions and Further Work Summary . . . . . . . . . . . Original Version . . . . . Devectorised Version . . Shared Memory Version BSP Version . . . . . . . MPI Version . . . . . . . . . . . . . . . . . . . . . .

142
142 149 151 155

A Subroutine `dmdt'

156
156 159 162 167 171

B Execution Times
B.1 SGI Challenge . . . . . . . . . . . . . B.1.1 Shared Memory . . . . . . . . B.1.2 Bulk Synchronous Processing B.1.3 Message Passing Interface . . B.2 KSR1 . . . . . . . . . . . . . . . . . B.2.1 Shared Memory . . . . . . . . B.3 Cray T3D . . . . . . . . . . . . . . . B.3.1 Bulk Synchronous Processing B.3.2 Message Passing Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

178
178 178 179 180 181 181 184 184 185

Bibliography

186

List of Tables
3.1 Overheads Associated with Barrier Synchronisation on the SGI . . 3.2 Overhead Associated with Using the `$DOACROSS' Directive on the SGI Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Overhead Associated with Using the `tile' Directive on the KSR1 3.4 Overheads Associated with Barrier Synchronisation on the Cray T3D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 49 52 54

4.1 Simple Code Transformations . . . . . . . . . . . . . . . . . . . . 69 4.2 Code Blocks used in Routine Bodies . . . . . . . . . . . . . . . . 69 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 Software Metrics for Final Codes . . . . . . . . . . . . . . . . . . Execution Times for Serial Codes . . . . . . . . . . . . . . . . . . Parallel Loop Overheads on the SGI Challenge . . . . . . . . . . . Overhead Analysis on SGI Challenge . . . . . . . . . . . . . . . . Parallel Loop Overheads on the KSR1 . . . . . . . . . . . . . . . Overhead Analysis on the KSR1 . . . . . . . . . . . . . . . . . . . PMON Measurements and Execution Times for Codes on the KSR1 Page Miss Statistics for the Shared Memory Code on the KSR1 . Development Cycle Diary . . . . . . . . . . . . . . . . . . . . . . 97 112 115 115 122 122 124 128 134

List of Figures
3.1 3.2 3.3 3.4 Von Neumann Architecture (I/O Omitted for Clarity) . . . . . . . Multiprocessor Model . . . . . . . . . . . . . . . . . . . . . . . . . Multicomputer Model . . . . . . . . . . . . . . . . . . . . . . . . . Simple Fortran Code with Parallel Directives for the SGI Challenge and KSR1 compilers . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Bulk Synchronous Parallel Computer (Barrier Synchronisation Mechanism Omitted for Clarity) . . . . . . . . . . . . . . . . . . . . . . 4.1 Proposed Development Path for Transition to Distributed Memory Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Initial Alteration to Serial Code . . . . . . . . . . . . . . . . . . . 4.3 Alterations to Subroutines and Functions . . . . . . . . . . . . . . 4.4 Decision Tree for Individual Code Statements . . . . . . . . . . . 5.1 5.2 5.3 5.4 5.5 5.6 Hierarchy of N-Body Problems . . . . . . . . . . Cell Hierarchy Used in Serial Code . . . . . . . Example Vectorised Code . . . . . . . . . . . . Example Code with Loops Reversed and Fused Example `Gist' Visualisation Tool Display . . . Fortran Loop with Iterations of Di ering Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 33 33 38 44 62 65 66 68 76 77 80 82 87 88

6.1 First Shared Memory Version of Example Code . . . . . . . . . . 99 7

6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 6.12 6.13 6.14 6.15 6.16 6.17 6.18

First BSP Version of Example Code . . . . . . . . . . . . . . . . . First MPI Version of Example Code . . . . . . . . . . . . . . . . . Altered Shared Memory Version of Example Code . . . . . . . . . Altered BSP Version of Example Code . . . . . . . . . . . . . . . Altered MPI Version of Example Code . . . . . . . . . . . . . . . E ciency Graphs for the Shared Memory Codes on the SGI Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E ciency Graphs for the BSP code on SGI Challenge . . . . . . . E ciency Graphs for the MPI code on SGI Challenge . . . . . . . Comparative E ciency Graphs for the SGI Challenge . . . . . . . E ciency Graphs for the Shared Memory Codes on the KSR1 . . E ciency Graphs for the BSP Codes on the Cray T3D . . . . . . E ciency Graphs for the MPI Codes on the Cray T3D . . . . . . Comparative E ciency Graphs for the Cray T3D . . . . . . . . . E ciency of Shared Memory Code versus Development Time . . . E ciency of Threads and Barriers Code versus Development Time E ciency of BSP Code versus Development Time . . . . . . . . . E ciency of MPI Code versus Development Time . . . . . . . . .

100 101 102 103 105 114 117 118 120 126 129 131 133 136 137 138 139

Abstract
Two hardware-based parallel programming methods, shared memory and message passing, are evaluated, together with one model-based method, Bulk Synchronous Programming (BSP), in respect of their suitability for parallelising existing serial codes. The evaluation is based on the codes produced by parallelising one particular serial code { an N-body micromagnetics program for simulating the magnetisation of thin lm media { under each method. Although the performance of the produced codes are important, the maintainability of the codes is posited to be more important. In order to compare maintainability, the aspects of readability, development potential for a non-expert, and portability are considered for each code. During the development cycle, the time spent programming each of the three codes is recorded, and the e ciency of each code is measured at regular intervals. This allows e ciency to be plotted against development time, showing how long each method takes to yield an e cient code. The results show that the shared memory method, programmed using parallelising compiler directives, is the most appropriate method for parallelising existing codes. Shared memory is shown to yield more maintainable codes than either message passing or BSP, despite the seemingly superior portability of codes developed under the latter. During the evaluation of the methods, an incremental development method for producing distributed memory model codes is proposed and explained. This method provides an evolutionary way of producing such codes from serial codes 9

{ a task which has previously been performed using revolutionary methods. The automation of this development method is identi ed as a potential area for future work.

10

Declaration
No portion of the work referred to in this thesis has been submitted in support of an application for another degree or quali cation of this or any other university or other institution of learning.

11

Copyright
Copyright in text of this thesis rests with the Author. Copies (by any process) either in full, or of extracts, may be made only in accordance with instructions given by the Author and lodged in the John Rylands University Library of Manchester. Details may be obtained from the Librarian. This page must form part of any such copies made. Further copies (by any process) of copies made in accordance with such instructions may not be made without the permission (in writing) of the Author. The ownership of any intellectual property rights which may be described in this thesis is vested in the University of Manchester, subject to any prior agreement to the contrary, and may not be made available for use by third parties without the written permission of the University, which will prescribe the terms and conditions of any such agreement. Further information on the conditions under which disclosures and exploitation may take place is available from the head of Department of Computer Science.

12

Acknowledgements
Firstly I must thank Professor John Gurd, my supervisor, for reading parts of this thesis at various stages of development { my grasp of writing English has improved considerably. A strong vote of thanks also goes to Mark Bull who suggested the incremental development method, and who put up with my endless questions while I was learning Fortran. Thanks also to Graham Riley who has provided many a useful reference, and, along with Jim Miles, the basis for this work. On a more personal note, thanks to Rizos, who also gave up some of his time to read this thesis, and over the year has provided many an interesting conversation. Still on conversations, the daily tea breaks with the members of the Centre for Novel Computing { usually Graham, Elena, Mark, Rupert, Dave and Andy { have supplied many discussions, some informative, some just bizarre. A huge thank-you goes to my parents, without whom this, and the forthcoming study for a PhD, would, quite literally, not have been possible. Last, but by no means least, thanks to all the friends and family who have kept me going, especially Rick and Uncle Bob for their helpful advice and nights out.

13

Chapter 1 Introduction
The `parallelisation' of existing serial codes is an important task in the eld of High Performance Computing (HPC). Despite three decades of research and commercial development in this area, it is not clear which of the many available parallel programming methods is most suitable for this task. Three parallel programming methods are selected and compared as to their suitability for parallelising existing codes. In order to evaluate the three methods, the same serial code is parallelised using each method. Firstly the motivations for such an evaluation are given in full. Next, the aims of the thesis are explained. Related works are discussed, and the chapter concludes with an outline of the structure of the rest of the thesis.

1.1 Motivation
In many cases, old codes don't die. The cost of completely re-engineering a code may prove to be prohibitive. Either that, or it may be the case that the original author is not available, and so there are parts of the code which no-one understands any more. For whatever reasons, many people, in both industry and academia, are still using codes which may be ten years old, or even more. This 14

CHAPTER 1. INTRODUCTION

15

situation is only acceptable until either the hardware being used to run the code becomes too expensive (or impossible) to maintain, or other pressures mean that the code must run in less time than before. Provided the language in which the code is written is implemented on new architectures then, for any sizeable code, porting the code should be far cheaper than re-writing it. Unfortunately, a new architecture does not bring an automatic performance improvement. The old code was most probably highly tuned to the old platform. Older compilers could perform far fewer optimisations than today's compilers, so the older the code, the more tricks that will have been employed to squeeze out the last few drops of performance from the hardware. Readability is often sacri ced. Perhaps the most extreme case is that of programs written or adapted for vector processors. Older compilers for these machines could not vectorise loops with many lines of code, and so these were not used if possible { being replaced instead by numerous loops of few lines of code, avoiding conditional statements within the loops wherever possible. This process of loop splitting makes it harder to tell what the code is doing. For the code to be tuned for the new architecture, it would often be preferable to return to an untuned, serial version of the code. But, even in cases where such a version exists, it is unlikely to have been kept up to date over the years, as modi cations and bug- xes were made to the tuned version of the code. Finding a new architecture which is capable of delivering comparable performance to the old platform may not be easy. This may not be a problem for many, but for highly tuned codes running on large vector or array processors this may be a big issue. For this reason, many people from these communities have looked to parallel architectures to nd the additional performance they require. Unfortunately, despite over thirty years of research and commercial development in the area of parallel computing, it is still not clear which method of parallel

CHAPTER 1. INTRODUCTION
programming is most suitable for the task of parallelising existing codes.

16

1.2 Aims
Three parallel programming methods will be evaluated, based on qualities of the parallel codes which they produce. Criteria will be chosen which can be used to judge the suitability of such methods for the task of parallelising existing codes. Next, three di erent methods that can be used to parallelise serial codes will be selected and judged according to the chosen criteria. The selection of three methods will allow very di erent methods to be considered, while still permitting an in-depth study. The results of these evaluations will be important, as this is one of the rst comparisons of its kind, and should be useful to anyone who is considering parallelising a serial code. Although the three methods will only be judged on the basis of one code (due to the time frame and the study being all the author's work), the conclusions for this code should be valid for some similar codes. More importantly, it is hoped that the method used, including the criteria chosen, can be recognised as both valid and useful for people wishing to perform similar comparisons for di erent classes of codes, or perhaps comparisons of di erent methods. Hopefully, this work may even provide the impetus for a larger study to be performed (see Section 7.3 for discussion). The shared memory programming method will be shown to be the most suitable method for parallelising existing codes, being both easier to use than distributed memory models, and generating more maintainable code. It will also be shown that these advantages are not gained at the expense of performance.

CHAPTER 1. INTRODUCTION

17

1.3 Related Work


Comparisons of parallel programming methods are rare, and typically compare methods solely on issues of execution times and speedup, e.g. Levelt92] which compares two di erent methods for distributed shared memory, one of which is virtual shared memory. Far more common in the literature are papers detailing the solving of a problem where a parallel implementation was used, with perhaps a comparison of results for di erent architectures. Examples of this type of paper can be found in Ford95] for shared memory, Zhang94] for message passing, and Hill96] for BSP. Recent N-body literature contains several works examining parallel implementations, e.g. Holt95] and Board95]. A more recent, and very thorough, exploration of implementing e cient N-body solvers in parallel languages can be found in Hu96]. Fenton91] is a general work on the subject of Software Metrics. More speci c works on the use of metrics as a tool for helping software engineering project management can be found in Grady92] and Stark94]. As stated in Section 2.2, based on Lanning94], metrics may be used to assess the readability of a code. More qualitative treatments of readability can be found in the area of Programming Psychology Sheil81], under the more speci c topic of program understanding or comprehension, e.g. Brooks83].

1.4 Structure of the Thesis


The remainder of this thesis is organised into ve chapters, and one appendix. Chapter 2 brie y introduces the eld of High Performance Computing, then discusses the criteria which will be used to evaluate the three parallelising methods. The methods themselves, and the reasons why they were chosen are introduced

CHAPTER 1. INTRODUCTION

18

in Chapter 3. Chapter 4 presents a novel development method for gradually evolving distributed memory model codes from serial codes. Chapter 5 introduces the serial code which will be parallelised. The rest of the chapter deals with the development of the three parallel codes. Results are presented in Chapter 6, and a discussion of these forms part of Chapter 7, the conclusions chapter, which also evaluates the work carried out and discusses extensions and possible future work. Finally, the source for one of the subroutines in the program is presented in Appendix A { in its original form, plus one intermediate version and the three nal versions.

Chapter 2 Evaluation Criteria


Following the identi cation of the goals of the thesis, the criteria for evaluating the parallelising methods can now be discussed. To give some context to the thesis, the eld of High Performance Computing will be introduced, and the area of Parallel Computing identi ed within this eld. So that suitable criteria may be chosen, the basic goals of the eld will be introduced and discussed brie y. Bearing in mind the problems with legacy codes identi ed earlier, the criteria which will be used to compare the three parallelising methods will be selected and explored. In particular, the ways in which to measure the criteria will be considered, and the relative importance of the criteria will also be discussed.

2.1 High Performance Computing


High Performance Computing (or HPC) is concerned with problems where the von Neumann architecture (de ned in Burks46]) is, for one reason or another, inadequate. Sometimes the inadequacy is to do with performance { there is a scienti c or numerical code which must be run faster or on larger problems; other times it may be a matter of issues such as expressibility { a machine is required which bridges the semantic gap, which has hardware that is closely related to the 19

CHAPTER 2. EVALUATION CRITERIA

20

way in which it is to be programmed, e.g. ALICE, a graph reduction machine for executing functional programs (described in Cripps87]). To overcome these inadequacies, the traditional von Neumann architecture may be used as a starting point from which to improve (e.g. pipelining, vector processors), or it may be discarded in favour of something completely di erent (e.g. graph reduction machines, and functional programming machines). Although the eld of HPC has been around for over thirty years now, it is still not clear how best to improve upon the von Neumann architecture. Since the beginnings of HPC, di erent techniques have dominated at di erent times. Often these techniques were based upon the technologies which were available at the time. This meant that HPC was often architecture-driven, a good example of this being vector processors, which were popular for a long time. The idea to use more than one processor simultaneously to solve a single problem is not a new one. Some early examples of highly parallel architectures are the early array processors, such as the pioneering SOLOMON (described in Slotnick62]) which contained a two dimensional array of 32 32 processors, each with a local memory of 128 32-bit numbers, under the control of a single instruction stream from a central control unit. This architecture was the basis for future key architectures such as Illiac IV Slotnick67], although von Neumann's cellular automata Neumann51] could be considered to be a form of parallel machine. Today, however, parallelism can be found to a certain degree in all but the simplest scalar machines. Given that a pipeline is a form of parallel computation, it can be seen that even the humble home computer performs some parallel processing { the Intel Pentium chip Intel95] has a multi-staged oating-point pipeline. From this it becomes clear that parallel processing is a fundamental part of modern computing { not just a niche subject area. Indeed the amount of parallel processing in systems can be seen as a grey-scale { machines with simple pipelines at

CHAPTER 2. EVALUATION CRITERIA

21

one end (just after truly scalar machines), followed by vector machines, through machines containing co-processors, with machines containing thousands of main processors at the other end. It is machines with more than one main processor, i.e. machines towards the latter end of this scale, which are usually being referred to when discussing Parallel Computing.

2.1.1 Parallel Computing


Parallel Computing is the dominant sub-area of High Performance Computing. The primary motivation for Parallel Computing is performance. Consider for example, a weather simulation which must be run in a quarter of the time which it currently takes; or else, the simulation must be made more accurate by dividing up the area of simulation into smaller chunks, but run just as quickly. The goal of Parallel Computing is to use more than one processor (and hopefully many processors) e ciently to solve a given problem. In the example given, four processors should ideally be able to run the code in a quarter of the time that one (similar) processor can; although this is hard to achieve in all but a few trivial EPPs (Embarrassingly Parallel Problems). Performance issues are discussed in more detail in Section 2.3.

2.2 Maintainability of Code


Code maintainability is an issue which is often ignored for the sake of improved performance. Many nd it quite acceptable to contort code, often hiding its original meaning, even for a relatively small gain in speed. Perhaps the vector processor community is the best, or rather worst, example of this. Vector processors, the rst of which was the CRAY-1 Russell78], have instructions for manipulating ordered sets of numbers. To use these machines e ectively, without

CHAPTER 2. EVALUATION CRITERIA

22

resorting to writing assembly language, a compiler capable of spotting opportunities for using these instructions is required. Until recently, compilers for vector machines could only recognise opportunities for vectorisation in loops consisting of just a few lines of code. This led to large loops being split up into smaller ones, sometimes looping over di erent array indices than might be expected. Another problem, which is common to both vector machines and pipelined machines, is that of branching. On a pipelined machine, any instructions which have been prefetched from the code after the branch point must be discarded if the branch is taken. This is costly, but not nearly as expensive as on vector machines. Vector machines rely on the same operations being carried out for many consecutive elements of an array, but if there is a conditional branch inside such a loop, the compiler will not be able to generate vector instructions to execute the loop. This places pressure on the user to remove conditionals from inside the loop, sometimes involving a change in algorithm, perhaps utilising some `trick' to achieve this; all of which conspires to hide the original meaning of the code from the user. Both internal documentation (i.e. comments in the code) and external documentation are important facets of maintainability, but for a program to be truly maintainable, one software engineer must be able to pick up where another left o with little trouble. So, without readable code, internal and external documentation are of little use to someone who has never seen the code before. Another issue is that codes will often be maintained by their users, who may not be software engineers, e.g. research codes will often be looked after by the scientists who are using the code. In these cases, for a program to be maintainable, people who are not experts in the architecture on which the code runs must be able to modify the program successfully. Another factor a ecting code maintenance is portability. The necessity of being able to port a code has been mentioned in Section 1.1, but it is worth

CHAPTER 2. EVALUATION CRITERIA

23

mentioning again as an aspect of maintainability. Some methods of parallelising will yield instant portability to a wide variety of platforms, e.g. Parallel Virtual Machine (PVM) message passing library Geist93]. Although this guarantees the future of the code, it does not automatically provide full maintainability, as important features, such as readability of code, are not ensured. In the light of the foregoing, maintainability will be judged by considering the following three important questions:

Readability How easy is the code to understand? Development Potential How easy will the code be to modify/debug by a nonexpert?

Portability How easy will it be to port the code?


These factors essentially govern the maintainability of the code { if the code is intuitive, easily modi able by a non-expert, and readily portable, then the code has a future; it is maintainable. Clearly the above elements are all interrelated, e.g. the code will be easier to modify if it is readable. Nonetheless, they are su ciently distinct to be considered separately. To evaluate each method for maintainability (i.e. how maintainable the code parallelised by that method is), the nal codes can be compared not only to each other, but also to the serial version provided. This is itself an important point, as maintainability is not just an end result, but an ongoing, ever-changing property of the code. The initial maintainability of the serial code is also important, as a lack of readability may hinder the process of parallelising the code. In order to exploit fully the chosen parallelising method, it may be necessary to spend some time untangling the optimisations which have been performed. This process is a form of reverse engineering, and will certainly a ect the development times for each

CHAPTER 2. EVALUATION CRITERIA

24

code. Although the initial maintainability of the serial code cannot feature in the evaluation of maintainability of the nal codes, it will, amongst other things, have an impact under the `Ease of Use' criterion. (Section 2.4). Clearly all three of the above elements of maintainability are hard to quantify as they are all inherently subjective. The qualities of development potential and portability will only be discussed in relative terms, however, ideas from the area of Software Metrics may be used to quantify readability. Issues concerned with measuring and comparing each of the three elements will now be discussed. Software metrics attempt to quantify properties of software which are usually regarded as purely qualitative. The measurement of speci c attributes of a code may be used to monitor the development cycle, thus enabling a developer to learn more about, and potentially improve, their development process, as in Grady92]. The evaluation of a nished code is required here, rather than a continuous monitoring, but the principles are the same. In Lanning94], source code complexity is related to `maintenance di culty'. The number of executable statements, the number of operators and the number of operands are demonstrated to be related to maintenance di culty. This is intuitive, as the larger the program is, the harder it will be to read. To someone who knows the serial code well, the number of lines which are di erent in the parallel codes will also be important { the more lines changed, the less readable the code will be to them. It is important that the measuring process is automated, so that it does not become too time consuming. For this reason, rather than measuring the number of operators and operands, the number of printable (i.e. non-whitespace) characters will be measured for each of the codes. As stated earlier, the number of lines changed between the serial code and each parallel code will also be measured. Another important measure of change in a code is the number of data structures which are added or modi ed. The addition of a data structure to a code is

CHAPTER 2. EVALUATION CRITERIA

25

signi cant because it represents a modi cation to the algorithm of the code. The modi cation of a data structure may also represent algorithmic change, but often modi cations are due to a change in notation, e.g. a reordering of the indices of an array. The same argument applies, that the more a code changes, the less readable it becomes to someone who knew the serial code well. For this reason, the number of new data structures will be counted, as will the number of data structures which are altered. To avoid confusion, the reason for data structure alterations will be noted in each case. To demonstrate Development Potential, it may be necessary to resort to examples, as the completed parallel codes will require no further modi cations. While it would be possible to consider modi cations to an area of this code, it is likely that the examination of a smaller example code would be preferable. When comparing the portability of di erent codes, it is important to consider why one code is more portable than another. For example, if one library-based system was found to yield less portable codes than another comparable system, was this due to inherent properties of the library, or simply a lack of time spent porting the library to di erent platforms? It is important that a code is not judged to be less portable than another simply because of factors which may change with time { codes' portabilities should be compared based on unchanging advantages and disadvantages. Considering the reasons for a code's level of portability will help ensure that the codes are not assessed unfairly.

2.3 Performance
The driving force behind Parallel Computing is performance { trying to run programs as fast as possible, by using more than one processor. Clearly execution time is too crude a measurement alone { it does not take into account how well resources are being used, and whether or not this changes with the number of

CHAPTER 2. EVALUATION CRITERIA

26

processors used. As stated in Section 2.1.1, the goal of Parallel Computing is to use more than one processor e ciently to solve a given problem. A code's e ciency can be measured { for a xed problem size and on a given number of processors { by comparing the execution time to that of the fastest serial version of the code. For a given problem and problem size, let the execution time for the fastest serial version of the code be Tseq . Ideally, the parallel code would be able to solve the same problem in half the time on two processors, a third of the time on three processors, and so on. So for p processors, the ideal execution time, Tideal (p), is given by: Tideal (p) = Tseq p . Now let the measured execution time for the parallel code on p processors be Tpar (p). Then the e ciency on p processors, E (p), can be de ned as: ideal (p) (2.1) E (p) = T Tpar (p) = pTTseq (2.2) (p)
par

Clearly, for any given problem size and number of processors, the closer a program's actual execution time is to the ideal execution time, the more e cient that program is. Another common term used when discussing parallel codes is speedup, which, for a given number of processors, is de ned as how many times faster than the best serial code the parallel code runs. Clearly, the ideal situation will yield linear speedup, giving: Sideal (p) = p : (2.3) Now de ne actual speedup, Spar (p) as:

Spar (p) = T Tseq (p) ;


par

(2.4)

where Tseq and Tpar (p) are the best sequential, and actual p-parallel execution time, as de ned earlier.

CHAPTER 2. EVALUATION CRITERIA

27

There are occasions on which super-linear speedup (sometimes called speedup anomaly) may be obtained. As is clear from the equations above, this is equivalent to the situation where E (p) > 1. This is typically due to the increased amounts of cache which are available when using more than one processor. Clearly, for any given problem size, there will be a limit to the number of processors which can be used e ciently to solve the problem. Indeed, if a problem is su ciently small, it may be that the program cannot operate e ciently above three or four processors. However, with a su ciently large problem size, it should be possible to use many processors e ciently. The plotting of graphs of e ciency versus <numbers of processors> will enable the performance of each code to be judged. Of particular interest will be how well the codes perform with large numbers of processors, and whether the codes achieve higher e ciencies with larger problem sizes.

2.4 Ease of Use


In any area of programming, development costs are a major issue. Even in the eld of High Performance Computing, where performance is taken to be of paramount importance, development time is of considerable importance. Most parallel programming methods are incremental in their approach, and, given the investment of more time, will yield a more e cient code. Typically, later steps will take longer and yield smaller improvements, with e ciency behaving asymptotically. At some stage in the development process, it may be preferable to halt development if the increases in e ciency drop below a certain criterion, e.g. 1% per engineer-day, sacri cing any future gains in e ciency in order to save on development costs. If two programming methods take di erent development times to produce codes which are comparably e cient, then the method requiring less development

CHAPTER 2. EVALUATION CRITERIA

28

time can be said to be easier to use than the other, as it has achieved the same goal in less time. Comparing methods in this way may be complicated if codes do not achieve the same level of e ciency. For example, if one parallelising method produced a code with a typical e ciency of 0.5 in only 1 engineer-hour, but it took a second method 4 engineer-hours to achieve the same level of e ciency, then the rst method would seem to be easier to use. If, however, the second method could go on to yield e ciencies of up to 0.9, after a further 4 engineerhours of e ort, but the rst method could make no further improvement, then clearly the second method is preferable. However, in this study, an existing code is to be parallelised, which will be done without any algorithmic changes. This should mean that each parallel code will eventually yield a similar level of e ciency. In the event that there is a large di erence in the nal e ciencies of the codes produced by the three methods, one method will only be considered to be easier to use than another if it takes less development time and it yields at least similar e ciency levels (within 5%). As stated earlier, the way e ciency changes during development is interesting, and so the e ciency of the codes will be measured during the development cycle, and graphs of e ciency versus development time will be plotted for di erent <problem size, number of processors> pairs. This will allow the consideration of more than just the total development time, e.g. the time up until when it would have been best to halt development, when considering which method is easiest to use.
1

An example of parallelising methods which yield quick results are parallelising compilers, which, at the time of writing, yield substantially less e cient code than manual methods.
1

CHAPTER 2. EVALUATION CRITERIA

29

2.5 Bias
In order not to bias the results towards a particular number of processors, or to particular problem sizes, e ciency will be plotted for a representative set of <problem size, number of processors> pairs. When measuring ease of use, simply measuring the amount of time spent developing each method may be deceptive. The main reason for this is that time will be spent performing tasks which help all of the methods, such as getting to know the serial code, untangling hardware-speci c optimisations (as discussed in Section 2.2), etc.. This can be easily dealt with by keeping a diary of activities and associated times during the development cycle. This way, tasks can be marked as contributing to more than one method { as each method is evaluated, the diary should be reviewed, and earlier tasks which are being relied upon marked appropriately. More of a problem is the fact that knowledge of the code will inherently increase as time goes by, thus favouring methods which are evaluated later. This in unavoidable without using more than one person to evaluate the methods and will be reasoned about when examining the results (see Section 6.3 and Section 7.3).

2.6 Summary
The continued dependence of both academia and industry upon legacy codes was stated as a motivation for this thesis. As codes are to be relied upon for so long, their maintainability is of vital importance, which was re ected by the choice of maintainability as the rst, and most important, criterion. This criterion was divided into three areas: readability, which will be measured with the help of software metrics; development potential, or how easy the code is to modify by

CHAPTER 2. EVALUATION CRITERIA

30

a non-expert; and portability, which will compare the number of platforms on which the parallel codes can be run, and explore the reasons for any di erences. Performance is often seen as being of paramount importance in High Performance Computing, and so was chosen as the second criterion. Execution times will be measured and compared to ideal times based on a sequential version of the code, enabling the code's e ciency to be calculated. The third criterion chosen was `ease of use'. This criterion will attempt to measure the length of time taken by a particular method to produce an e cient code. As the parallelisations are all to be carried out by the same person, issues of bias were recognised as being important, and were discussed in depth. The selection of these criteria will enable the methods of parallelisation to be compared with respect to their suitability for parallelising legacy codes. With this part of the work done, the next step is to introduce the parallelising methods themselves.

Chapter 3 Methods to be Evaluated


With the evaluation criteria fully described, the parallelising methods which are to be evaluated can now be presented. To start with, the von Neumann architecture will be examined, considering in particular possible parallel extensions to the model. Suitable ways of programming these parallel architectures will then be considered as possible candidates for parallelising methods. As well as considering architectures which have been motivated by technology, the more abstract approach of bridging models will be examined. Although some bridging models may be unsuitable on the grounds of performance, this area will be used as the second source of possible parallelising methods. These two sources will provide the three parallelising methods which will then be discussed in more detail, and methods of implementing the methods will be considered. Target architectures for the parallelising methods will be introduced, and the last section of this chapter will give technical details of the machines used, including timings relevant to the parallel methods.

31

CHAPTER 3. METHODS TO BE EVALUATED

32

3.1 Beyond von Neumann


As stated in Section 2.1.1, the goal of Parallel Computing is to use more than one processor e ciently to solve a problem. Given this goal, how can it be achieved? The processors must be connected together in some way that allows them to co-operate. To commence this analysis, the von Neumann architecture will be considered. As shown in Figure 3.1, in its simplest form the architecture consists simply of a processing element and a word-at-a-time random access memory unit, connected via a simple bus. In reality, input/output (I/O) devices would also be connected to this bus, but these are omitted for the sake of clarity. If the architecture is to be extended to have more than one processor, this can be done in one of two ways: 1. more processors can simply be added to the bus, or 2. the bus can be extended to join several von Neumann machines together. For a many-processor architecture, a simple bus may well be inadequate. So, replacing the bus by a more general `interconnection network', these two extensions now give us the multiprocessor (Figure 3.2), and the multicomputer (Figure 3.3). Example hardware implementations of the multiprocessor are the Cray C90 and the SGI Challenge; examples of the multicomputer are the Cray T3D and IBM SP/2. The key di erence between the two models is in their memory organisation. The multiprocessor has a single continuous address space (centralised memory) which all processing elements can access whereas, in the multicomputer, each processing element has its own local memory (distributed memory). This gives us three di erent kinds of memory accesses: in the multiprocessor, all accesses are to the same global memory and will be called MP-global; in the case of the
1

Note that the use of cache, which is present to some extent in all modern computers, blurs the distinction between the two models, as cache is a form of distributed memory.
1

CHAPTER 3. METHODS TO BE EVALUATED

33

P P M M Processing Element Memory Unit

Figure 3.1: Von Neumann Architecture (I/O Omitted for Clarity)


P1 P2 P3 Pn

INTERCONNECTION NETWORK

M
P M Processing Element Memory Unit

Figure 3.2: Multiprocessor Model


INTERCONNECTION NETWORK

P1

P2

P3

Pn

M1

M2

M3

Mn

P M

Processing Element Memory Unit

Figure 3.3: Multicomputer Model

CHAPTER 3. METHODS TO BE EVALUATED

34

multicomputer, there will be accesses to memory locations in the processor's own, local, memory { which will be called MC-local { and also, accesses to other processors' memory locations { these accesses will be called MC-remote. The most common technique for handling MC-remote accesses is to leave it up to the programmer to deal with by explicitly passing messages containing data from one processor to another. This technique is known simply as Message Passing. The two models are equivalent, in that they can emulate each other. This is easy to see in the case of the multiprocessor emulating the multicomputer: each processor can be made to access disjoint memory locations, and message passing can be simulated by simply copying data from an output bu er in one processor's locations to an input bu er in anothers'. For a multicomputer to emulate a multiprocessor is harder { since all processors need to be able to access all of the memory locations, an underlying system is needed, which can resolve an access to data which is located in another processor's local memory. Such a system is known as Virtual Shared Memory (VSM) and may be implemented in software, hardware, or a combination of both. A good overview of issues concerned with the implementation of shared memory on a distributed memory machine is provided in Murray93], which also recognises the di culties with message passing arising from the need for explicit communications. The techniques of Shared Memory (true or virtual) and Message Passing continue to dominate research into Parallel Computing, and indeed High Performance Computing as a whole. But, although both systems are targeted at the same hardware, and both systems have been around for at least two decades, it is still not clear which system is best. Indeed, during the 1980s, as European research was shifting away from Message Passing towards Shared Memory, research in the United States was moving in completely the opposite direction. Because these methods are so dominant, Shared Memory and Message Passing will be two

CHAPTER 3. METHODS TO BE EVALUATED

35

of the methods which are evaluated. As was stated earlier, research into parallel computing has been ongoing for decades, but as yet parallel computing has not become commonplace { it is still a specialist area. One suggested reason for this is that current methods of programming parallel architectures are too semantically distant from the goals of their users, so it is di cult for users to express their problems on parallel machines. The proposed solution for this is that architectures should be based not on technology, but on more abstract ideas { that is, models should be constructed which provide the desired programming style, and these models then implemented in hardware. In other words, architectures should be designed to provide a bridge between hardware and software. In this way, the von Neumann model is a bridging model for sequential computing. It allows expressive languages to be designed which can be e ciently compiled for the model, and the model itself can be e ciently realised in hardware. This has led to the ability to develop sequential programs which can be compiled and run e ciently on many di erent kinds of sequential computers. Several bridging models have been proposed for parallel computing, most notably the PRAM model Fortune78] and, more recently, the LogP model Culler93]. Bulk Synchronous Parallel (BSP) Valiant90] is a bridging model which was designed slightly earlier than LogP, but which is receiving renewed attention in the research community. This will be the third and nal method of parallelisation which will be evaluated. There are several, more radical bridging models, such as functional programming, which were considered as alternatives. Functional programming is rarely explicitly parallel Peyton-Jones96a], but the nature of the languages, such as

CHAPTER 3. METHODS TO BE EVALUATED

36

Haskell Hudak92] which often use lazy evaluation does not rule out the possibility of implicit parallelism. Despite recent compiler advances for these languages Peyton-Jones96b], they still generate far less e cient code than imperative languages, such as Fortran. This is most likely to be due to the gap between the bridging model and the hardware being too large, resulting in functional languages being hard to implement. As performance is important in this study, no such radical models will be used. With the three methods of parallelisation chosen, a common language is required with which to work. The three most popular languages for work in parallel programming are C Kernighan78], C++ Stroustrup94], and Fortran 77 ANSI78]. Most libraries designed for use in this area (e.g. PVM Geist93], MPI MPIForum94]) provide interfaces to these languages, and on shared memory machines these languages often have compiler support for parallelism provided. However, this study is about parallelising legacy codes, so Fortran 77 will be used, as due to the age of the language (Fortran 1 dates back to 1956) there will be vastly more legacy codes which are written in Fortran than in C. The three chosen methods of parallelisation will now be examined in greater detail.

3.2 Shared Memory Model


The Shared Memory model has the advantage that a single consistent memory map, like that of a sequential machine, is presented to programmers. This makes the model easy to understand for anyone who has programmed a serial machine beforehand { which will be nearly everyone. But the question remains, how to program such a machine? And, if there are di erent ways, then which is the best? The ways in which such a machine can be programmed really depend on what support for parallelism the manufacturers supply. There will clearly be system

CHAPTER 3. METHODS TO BE EVALUATED

37

functions to perform basic tasks. At least the following basic functionality will be present: There will be a mechanism for starting more than one thread of execution for the same program { this may be done after the program is running, by calling a system function from the rst process (e.g. Silicon Graphics IRIX Operating System), or alternatively, all the required processes may be started right at the beginning, (e.g. Cray UNICOS MAX Operating System). There will also be a mechanism to synchronise the threads of execution { some kind of barrier routine which none of the threads leave until all threads have called it. A locking mechanism will also be required in order to protect accesses from di erent threads to shared data from interfering with each other { these are a form of semaphore. However, these functions are at a low level { the last two may even be implemented directly in hardware. For people who are used to programming in languages such as C and C++, this may be acceptable { indeed programming in this way does yield a high level of control. But, for those used to programming in high level languages, such as Fortran, higher level abstractions are required if they are to program the machine without the risk of creating race conditions, or other low level timing bugs. For these reasons, compiler support for parallelism is often provided for the Fortran programming language. In this case, support will often be given in the form of compiler directives embedded in comments; for example, one such directive tells the compiler how to run the next DO loop in parallel. By using these directives, data items must be identi ed as shared or private (sometimes known as local), any variables which are involved in reductions must

CHAPTER 3. METHODS TO BE EVALUATED


c Variable declarations integer i double precision a,b,c,cmax dimension a(16),b(16),c(16) c cmax=0.0d0 c Main loop C$DOACROSS LOCAL(i), SHARE(a,b,c), REDUCTION(cmax) c*ksr* tile (i,private=(i),reduction=(cmax)) do 100 i=1,16 c(i)=a(i)*b(i) if(dabs(c(i)).gt.dabs(cmax)) cmax=c(i) 100 continue c*ksr* end tile

38

Figure 3.4: Simple Fortran Code with Parallel Directives for the SGI Challenge and KSR1 compilers be named, and so on. One of the advantages of using directives embedded in comments is that directives for more than one machine's compiler may be added to the same source code { each compiler will treat all other platforms' directives as comments and ignore them. To demonstrate this, an example code which has had parallelising directives added to it is shown in Figure 3.4. The code multiplies two arrays, a and b together, placing the results in c, and setting cmax to the element of c which has the largest magnitude. Here, the `$DOACROSS' comment enables the code to run in parallel on the SGI Challenge machine, and the `*ksr* tile' and `*ksr* end tile' comments enable the code to run in parallel on the Kendall Square Research KSR1. Unfortunately, there is no standard syntax for these directives, even though

CHAPTER 3. METHODS TO BE EVALUATED

39

the parallelising directives for other machines will contain roughly the same information. There are subtle di erences between the directives on di erent platforms { on the KSR1 variables are assumed to be shared if not declared as private, whereas on the Challenge they must be declared as such; on the KSR1 a directive is required to indicate the end of the loop, but on the Challenge no end of loop directive is needed. More pronounced di erences are typically found when comparing the di erent schemes provided for distributing the loop iterations to the threads of the program, but again, these will mostly be di erences in terminology. Apart from these nuances, the information given is identical. PCF-Fortran was an attempt to design a standard for the way in which Fortran was parallelised for shared memory, but the discussions broke down while discussing low-level implementation details, with the only output from the group being the draft document PCFForum90]. One attempt which has been made to provide a standard way of parallelising loops in Fortran, is the `forall' command in High Performance Fortran HPFForum93]. Unfortunately this command is limited in its scope, supplying a tightly synchronised kind of parallelism that is not adequate for many applications. Apart from this, no explicit parallelism is present in Fortran, but new commands have been added to Fortran90 ANSI92] for which a compiler could easily generate parallel code, such as vector style operations on arrays. Cray Research Incorporated produce a Fortran compiler capable of generating parallel code for their Massively Parallel Processors (MPPs), including the Cray T3D machine CrayMPP]. The parallel extensions to the language are referred to as the Cray Fortran programming model (CRAFT). Although this system also
2 3

2 The Fortran-S language Bodin93] provides a similar set of compiler directives for shared memory machines which was designed to be portable, but neither this nor any alternative has yet been adopted as a standard. 3 It is important to note that HPF is not a standard in the sense that the its speci cation HPFForum93] is not published by a recognised standards body such as ANSI or ISO.

CHAPTER 3. METHODS TO BE EVALUATED

40

works by using compiler directives embedded in comments to parallelise loops, CRAFT is much more restrictive than the systems on the SGI Challenge and the KSR1. With CRAFT, a loop which is to be parallelised must be speci ed in terms of its arrays, which themselves must be explicitly divided up, or partitioned, onto the processors. Also, it is only possible to loop over an array index which is an exact power of 2. This means that a program with directives for other machines cannot generally be taken and simply have more directives added to it { the code will require an amount of reworking as well. These restrictions are a result of the fact that in general the Cray MPPs are not shared memory machines, and so all support for data distribution and communications has to be generated by the compiler. Because of the nature of CRAFT, the Cray T3D will not be used as a target platform for the shared memory version of the code. The target architectures for the shared memory implementations will be a Silicon Graphics Incorporated Challenge Series machine, and a Kendall Square Research KSR1. Full details of these architectures are given in Sections 3.5.1, and 3.5.2, respectively, along with further information on their Fortran compilers.

3.3 Distributed Message Passing Model


The Distributed Message Passing Model has been popular for many years now. Its key advantage is that it can map onto almost any machine with more than one processor, or indeed even a network of uniprocessors, such as a cluster of networked workstations. This property has enabled the design of message passing libraries, which provide all of the required functionality to perform message passing, and can be ported to a vast array of di erent hardware. This enables programs to be written in an already portable programming language, such as Fortran 77 ANSI78], augmented only by calls to the message passing library. Such programs can then be ported to any machine for which there is a version of

CHAPTER 3. METHODS TO BE EVALUATED

41

the library, e.g. from a network of workstations to a Cray T3D, with little or no e ort. The main disadvantage of message passing is the lack of a contiguous memory map { the distributed memory model is thus not as intuitive as the shared memory model. Here the user must know exactly what parts of what arrays are held on which processors, and handle any interprocessor communication explicitly. However, this does result in the programmer considering data distribution and communication patterns more carefully than with Shared Memory, where all the communications take place automatically through the shared memory system. With message passing there is no real question as to the best way to program { a library is used which abstracts away from the communications layer, be it a processor interconnect in a multicomputer, bu ers in a shared memory machine, or even TCP/IP between workstations. The two most popular message passing libraries are the Parallel Virtual Machine (PVM) Geist93] and the more recent Message Passing Interface (MPI) MPIForum94]. Because it is more recent and increasing in popularity, MPI will be used to implement the message passing codes. The target architectures for the message passing codes will be a Silicon Graphics Incorporated Challenge Machine, and a Cray Research Incorporated T3D. These machines are discussed in detail in Sections 3.5.1 and 3.5.3, respectively, where details of the MPI implementations that will be used on each platform are provided. This section concludes with a brief look at MPI and its current functionality.

3.3.1 MPI { the Message Passing Interface


The standard for MPI, as de ned in MPIForum94], is large. The rst version of this document was the preliminary standard (known as MPI1), proposed in draft form in 1992, and revised and re-issued in early 1993. The rst version of

CHAPTER 3. METHODS TO BE EVALUATED

42

the MPI standard proper appeared in November 1993 (Version 1.0) and, since then, the standard has been through only one minor revision, resulting in Version 1.1 which was issued in June 1995. Since April 1995, the MPI forum have been discussing MPI2, a proposed set of extensions to MPI. A draft proposal for MPI2 is currently available over the Internet, and a nal version is expected to be published in March 1997. MPI makes good use of data types, with a message consisting of a number of data items of a particular type. This means that messages deal with data in a convenient form, rather than leaving the user to specify an area of memory in bytes. Handling messages in these terms is all the more important in a language, such as Fortran, which performs no type checking around function calls and so provides no `sizeof' function (as there is in C). As one message consists of data of only one type, operations have been provided to allow the creation of user-de ned data types, which are constructed from the base types provided. These new user de ned types can be used in future constructions and, together with the provision for both record structures and overlays, this provides the same functionality as `struct' and `union' in C. The basic send and receive calls are provided in both synchronous and asynchronous forms. These are the simplest calls available { a send is executed on one processor, a receive on another, and the message moves between them. As well as these simple calls, more complex calls have been added, to perform reductions, broadcasts, and so on. An example of this is `mpi_reduce', which is executed on all processors and takes as one of its arguments a `root' process number. In this case the `root' process executes receives to get the data from other processors and then performs the reduction, while the other processes execute sends. Because MPI was subject to detailed discussion throughout its design, and evolved over two years before its last revision in 1995, it contains many functions

CHAPTER 3. METHODS TO BE EVALUATED

43

which implement common communication patterns, thus enabling code to be written in a simpler form than it could be if just coded in terms of sends and receives.

3.4 Bulk Synchronous Parallel


The key idea behind BSP is to overlap computation and communication. In this model, memory is distributed, but is considered to be a shared resource, accessible via put and get operations. The model was rst described in Valiant90], where it is stated that the programmer should write programs with su cient `parallel slackness'. This means that programs should be written for a number of processes (referred to as virtual processors, v), which is considerably larger than the number of processors (referred to as physical processors, p), with a minimum for v of v = p log p being suggested. It is further explained that the compiler can then exploit this slackness by clever scheduling which will overlap communications and computations e ciently. It is mentioned that automatic memory and communication schemes are not mandatory { the model does allow the user to retain control of these elements if they so desire. The BSP model of parallel computation { or the BSP computer { as de ned in Valiant90], consists of three parts: 1. Several components which perform processing and/or memory functions; 2. A router which delivers messages point-to-point between pairs of components; 3. Facilities for synchronising all or a subset of the components. Figure 3.5 shows an abstract and simple form of the BSP computer, showing both the elements and the router. Note that the method for barrier synchronisation is not shown in order to simplify the diagram.

CHAPTER 3. METHODS TO BE EVALUATED

44

P Router M

M M

P/M P

P M

Processing Element Memory Unit

Figure 3.5: Bulk Synchronous Parallel Computer (Barrier Synchronisation Mechanism Omitted for Clarity) The model is su ciently abstract that it may be implemented in many ways. The router could be an autonomous unit, travelling between processor/memory elements delivering messages, or it could simply be a xed interconnection network. Also, any distribution of memory is acceptable { from completely centralised to completely distributed, including all possible hybrids in-between. This means that the BSP computer can be realised easily on any existing multicomputer or multiprocessor machine. Hence, already, one of the two criteria for a good bridging model has been satis ed. Whether or not an expressive software interface can be provided by BSP is yet to be shown. To program the model, a computation is divided into supersteps. A superstep consists of both local computation and interprocess communication. Communication di ers from traditional message-passing in two key ways: 1. all communication is `one-sided', i.e. a put does not require a corresponding get on another processor, and 2. communication can happen at any time from when it is initiated until the

CHAPTER 3. METHODS TO BE EVALUATED


beginning of the next superstep.

45

This uncertainty as to precisely when data is communicated gives the model one of its most interesting features, and also gives a lot of scope for overlapping communications with computation. The disadvantage of this type of communication is that care must be taken not to overwrite another processor's workspace by issuing a put to this area which could conceivably be executed immediately. Since the model was proposed, in 1990, research into BSP has been continued by several groups working independently. Two groups have produced portable BSP software libraries for writing software on parallel machines, namely Oxford BSP Miller93], and Green BSP Goudreau95]. Meanwhile the Harvard team have concentrated on producing more theoretical work, such as Gerbessiotis92], and have recently proposed a complete framework which might be used to develop architecture-independent parallel software, based upon the BSP model Cheatham94]. In late 1995, it was proposed (by the Parallel group at the Computer Laboratory in Oxford University) that there should be an attempt to unify the ongoing research into BSP by producing a worldwide standard for a BSP library. A preliminary draft proposal was put together, and an E-Mail discussion initiated. As a starting point for the proposal, the functionality of the original Oxford BSP library was augmented with some of the functionality from the Green BSP library. The nal draft proposal Goudreau96] was issued in April 1996, with the revised list of authors containing names from the Oxford BSP and Green BSP teams, and the team from Harvard. Interestingly, the group has chosen to discard two of Valiant's key original ideas, namely parallel slackness (which was discarded in favour of having one process per processor) and subset synchronisation. Areas where subset synchronisation was required were identi ed by several members of the E-Mail discussion group { typically these were symbolic applications which

CHAPTER 3. METHODS TO BE EVALUATED

46

used functional decomposition to obtain parallelism while parsing large graph structures. Despite these protestations, subset synchronisation was not reintroduced, on the dual grounds of keeping the interface to the library simple, and of keeping a simple performance prediction model. Indeed performance prediction was seized upon early after the rst libraries were constructed as being of key importance, and has remained a major driving force behind the new standard for BSP, e.g. Hill96]. The other key change is that the original `unpredictable' put and get primitives have been renamed as `high performance' put and get. The basic put and get calls now bu er all communication requests which then take place at the end of the superstep. This means that the BSP community is moving away from the unpredictability which gave the model one of its key di erences from traditional message-passing. This new standard BSP is known as World Wide BSP (or WWBSP). The initial standard for WWBSP has been implemented as a library, which is currently available (as a Beta-release) and can be located via the World Wide Web page for WWBSP WWBSPHome]. Version 0.6 of this library will be used to implement the BSP versions of the micromagnetics code. The library is available for many platforms, including the Silicon Graphics Incorporated Challenge Series machine, and the Cray Research Incorporated T3D, which will be the target architectures for the BSP implementations. Details of these architectures are given in Sections 3.5.1, and 3.5.3, respectively, along with further details of how the WWBSP library is implemented on each.

3.5 Target Architectures and their Compilers


This section provides technical information about the various target architectures, such as clock speeds, memory sizes, processors and their organisation. Support for

CHAPTER 3. METHODS TO BE EVALUATED

47

parallelism given by the Fortran compilers for each architecture is also discussed, as are the software libraries used in the implementations.

3.5.1 SGI Challenge Series


The rst target architecture is the Silicon Graphics Incorporated (SGI) Challenge Series machine. The machine has four, 100MHz, MIPS R4000 processors, connected together using a simple bus, and running the IRIX operating system, version 5.3. It is a true shared memory machine, with 512Mb of main memory. Each processor has 1Mb of level-two cache (direct mapped, 128 byte lines), and 16kb of level-one cache (direct mapped, 16 byte lines). The cache is all writethrough, so updates to the level-one cache are propagated to level-two cache and main memory immediately. Each processor may only address its own level-one cache, so if an address not contained within the level-one cache is accessed, then a new cache line containing the referenced address must be loaded into the level-one cache from the leveltwo cache. Similarly, if the address is not contained in the level-two cache, the data must be loaded from main memory. This scheme means that copies of the same memory location may appear in di erent processors' caches. To maintain consistency, an invalidate policy is used, so that when one processor writes to a memory location, all other copies of that location (in the other processors' caches) are automatically marked as being invalid; further reads to this location will cause the level-one and level-two caches to be updated from main memory. Sequential consistency is maintained, which means that all writes from any processor to any memory location are immediately visible to all processors. This is the most rigorous form of consistency to maintain, and is exactly what is provided in a uniprocessor. For more details of consistency models, see Adve95].

CHAPTER 3. METHODS TO BE EVALUATED


No. of threads Overhead (microseconds) 1 2 2 18 3 32 4 50

48

Table 3.1: Overheads Associated with Barrier Synchronisation on the SGI

Lock and Barrier Costs


Acquiring a lock using the `mp setlock' command costs 80 processor cycles (0.8 s) if the calling processor was the most recent owner, and 400 cycles (4 s) if another processor was. Releasing a lock costs around 60 processor cycles (0.6 s). These timings were obtained experimentally. The overheads for barrier synchronisations (using the `mp barrier' call) were also obtained experimentally, and are shown in Table 3.1.

Fortran Compiler Parallel Support


The SGI Challenge Fortran 77 compiler provides support for parallelism via the `$DOACROSS' directive which is embedded in comments in the code, placed just before the start of a loop. One directive is capable of making a `DO' loop run in parallel, and supports scalar reduction variables, as well as providing a variety of scheduling schemes to govern how the iterations of the loops are divided among the processors. The command cannot be nested, and further directives encountered while inside a parallel loop are ignored. There is a utility, PFA (Power Fortran Analyser), which analyses loops, and generates parallelising directives for loops in which it can nd no dependencies. Some amount of loop unrolling
4

Reductions are operations which decrease the dimensionality of their input, most commonly computing a scalar result from a vector of values. The operator used in reduction operations will typically be both commutative and associative, allowing the computation of local results on each thread. Typical reduction operators are sum, product, minimum and maxmimum.
4

CHAPTER 3. METHODS TO BE EVALUATED


No. of threads Overhead (microseconds) 1 4 2 12 3 15 4 17

49

Table 3.2: Overhead Associated with Using the `$DOACROSS' Directive on the SGI Challenge Dongarra79] is also performed. Unfortunately, the utility cannot always establish a lack of dependencies in long loops, and so here the directives will need to be derived manually in order to achieve maximum parallelism. The thread which runs the serial sections of the code (as well as some of the loop iterations) is referred to as the master thread, and the remaining threads, which only run iterations of parallelised loops, are known as the slave threads. The overheads associated with using a `$DOACROSS' are shown in Table 3.2. Comparing these with the Barrier costs shown in Table 3.1 it can be seen that a more e cient barrier algorithm than the one used by `mp barrier' is being used to synchronise the threads at the end of a parallelised loop. Again, the gures were arrived at experimentally.

Message Passing Interface Library


SGI provides a proprietary MPI library, but it is only available for IRIX version 6, which was not available on the departmental SGI Challenge. Because of this, only free shareware versions of MPI were available for use. Of these, the only version which claimed to provide support for shared memory machines directly, i.e. using shared areas of memory rather than a generally applicable socketsbased implementation, was MPICH. MPICH (currently at version 1.0.13) is a full implementation of the MPI standard MPIForum94], which is available via the World Wide Web from MPICHHome]. When installing MPICH, the default

CHAPTER 3. METHODS TO BE EVALUATED

50

setting for the SGI machine is to use a network device capable of sending messages to all processors inside the machine, but also to processors in other machines which have been clustered together. This setting was changed so that shared memory was used directly, as the only con guration being tested was that of a single machine.
5

World Wide Bulk Synchronous Parallel Library


The World Wide BSP library (Beta version 0.6) is implemented for shared memory machines using the Unix System V IPC calls. These calls provide semaphores, shared memory segments and message queues which can be used by any process on the machine, provided they have the ID number of the object and the correct permissions (which closely resemble le permissions). Clearly, this is not the most e cient system to use on a shared memory machine, as it is possible to set up areas of memory which are visible to all threads in that particular process, without the expense of making these available system-wide. This may change in future implementations of the library.

3.5.2 Kendall Square Research KSR1


The second target architecture is a Kendall Square Research Incorporated KSR1. This machine contains 31 proprietary processors which are clocked at 20MHz, running version R1.2.2 of the KSR version of the OSF/1 operating system. The KSR1's architecture is a Cache Only Memory Architecture (COMA), which means that all memory is organised as cache. The KSR1 ALLCACHE system means that memory locations have no speci c `home'. Each processor has 256kb of level-one cache which is organised as 2-way set-associative on 2kb blocks, with 64 byte lines. There is also 32Mb of level-two cache (the main
This was achieved by supplying `-device=ch lfshmem' instead of `-device=ch p4' to the `configure' command which is used before building the library.
5

CHAPTER 3. METHODS TO BE EVALUATED

51

memory), which is organised as 16-way set-associative on 16kb pages, with 128 byte lines. The level-one cache is write through, with changes being propagated immediately to level-two cache. Cache coherency is implemented in hardware only to the subpage level. This means that the operating system must maintain an up-to-date directory which states where di erent copies of pages are located. This is required so that events such as writes by di erent processors to the same page can be resolved correctly. The processors are con gured in a single slotted ring, which has a bandwidth of 1Gb/s. This means that, unless every processor in the machine is engaged in continuous communication, the network can never be saturated.

Lock and Barrier Costs


Acquiring a lock using the `mp setlock' command costs around 25 processor cycles (1.25 s) if the calling processor was the the most recent owner, and around 200 cycles (10 s) if another processor was. Releasing a lock costs around 25 processor cycles (1.25 s). These times were obtained experimentally. The overhead for a barrier synchronisation on the KSR1 ranges from 12000 processor cycles (0.6ms) to 20000 cycles (1ms), with a logarithmic dependency on the number of processors Grunwald93].

Fortran Compiler Parallel Support


The KSR1 Fortran 77 ANSI78] compiler provides support for parallelism via the `*ksr* tile' and `*ksr* end tile' directives which are embedded in comments placed in the code around a `DO' loop. This directive is capable of making `DO' loops run in parallel, and supports scalar reduction variables. The directive also provides a variety of scheduling schemes to govern how the iterations of the loops are divided among the processors, including dynamic scheduling utilising a `grab'

CHAPTER 3. METHODS TO BE EVALUATED


No. of threads Overhead (ms) 1 0.80 2 0.94 4 1.08 8 1.22 16 1.36 32 1.50

52

Table 3.3: Overhead Associated with Using the `tile' Directive on the KSR1 strategy, where the iterations are placed in a work pool, from which the threads take groups of iterations until the work is done. The directive can be nested, which is useful in loop nests where the loops are over fewer elements than there are processors, as it may then still be feasible to utilise all the available processors. It is possible to get the compiler to attempt to place these directives in the code automatically, using the KSR KAP utility. However, to ensure maximum parallelism, the directives need to be placed manually. The overhead associated with using a `tile' statement has a logarithmic dependency on the number of processors. The approximate overheads for when the number of processors is a power of two is shown in Table 3.3. These gures were obtained experimentally. As with the SGI Challenge, the thread which runs the serial sections of the code (as well as some of the loop iterations) is referred to as the master thread, and the remaining threads, which only run iterations of parallelised loops, are known as the slave threads.

3.5.3 Cray Research Incorporated T3D


The Cray T3D is the third target platform. The speci c machine used consists of 512, 150MHz, DEC Alpha processors, arranged in a 3D Torus. The processors on the T3D run the UNICOS MAX operating system, version 1.3. To provide an interface to the T3D, and to handle the running of batch jobs and basic I/O,

CHAPTER 3. METHODS TO BE EVALUATED

53

there is a front-end Cray Y-MP running UNICOS version 8.0.4. Each processor on the T3D has 64Mb of local memory, and 8kb of on-chip cache which is direct mapped with a cache line size of 32 bytes. The cache uses a write-back policy which means that writes to the cache are not immediately propagated to local memory. The operating system on the T3D itself is minimal, being responsible only for memory allocation and some I/O tasks. There is no hardware support for virtual shared memory, and support for virtual memory is minimal { while physical addresses are not used by the software, no swapping to disk is supported, so the program and its data must t within the 64Mb of main memory. Although the T3D does not support shared memory operation directly, there are calls which support accesses to other processors' memory. This system of low-level explicit data communication is known as the SHMEM programming method. The method is utilised by calls to highly optimised library routines which allow distributed data items to be used. The easiest cases of this are for static data, which will always be at the same location on all processors for the T3D.
6

Barrier Costs
An interesting feature of the T3D architecture is that it provides a hardware mechanism for performing barrier synchronisation. This means that the overhead for barrier synchronisations on the T3D compared with overheads for software barrier implementations is small. The overheads for barrier synchronisations (using the `barrier' call) were also obtained experimentally, and are shown in Table 3.4. These times compare particularly well with those for the SGI Challenge, especially as there appears to be little dependency upon the number of
This is because the T3D is programmed in a Single Program Multiple Data-stream (SPMD) manner. This means that the same executable image is loaded onto each processor, which then all begin execution simultaneously.
6

CHAPTER 3. METHODS TO BE EVALUATED


No. of threads Overhead (microseconds) 1 0.75 2 1.84 4 1.84 8 1.84 16 1.84 32 1.86 64 1.88

54

Table 3.4: Overheads Associated with Barrier Synchronisation on the Cray T3D processors.

Message Passing Interface Library


Cray Research Incorporated, together with the Edinburgh Parallel Computing Centre, have developed their own version of MPI, called CRI/EPCC MPI. This fully implements the current MPI standard MPIForum94], and also contains early versions of the one-sided messaging which is proposed in the current draft of the MPI2 standard.

World Wide Bulk Synchronous Parallel Library


The World Wide BSP library was originally implemented on the Cray T3D. BSP should be extremely e cient on the T3D, as its `bspput' and `bspget' calls map exactly onto the SHMEM calls. The other requirement of the BSP library is a mechanism for synchronisation which, as was mentioned earlier, is performed in hardware on the T3D. Version 0.6 (Beta) of the library has been used to obtain the results given later in this thesis.

CHAPTER 3. METHODS TO BE EVALUATED

55

3.6 Summary
The von Neumann model was introduced, and two simple parallel extensions to the model, the multicomputer and multiprocessor, were introduced. Although different in their approach to memory distribution, these two models were shown to be capable of emulating each other. The two most popular methods for programming in parallel on both of these architectures are shared memory and messagepassing, which still dominate research into High Performance Computing as a whole. This popularity led to shared memory and message-passing being chosen as two of the three methods to be evaluated. The multicomputer and multiprocessor models are examples of architecture design being driven by technological advancements. The concept of bridging models, which is an alternative way of driving the design of parallel architectures, was introduced. Several speci c bridging models were introduced, including models with implicit parallelism, such as functional programming; and models which rely upon explicit parallelism, such as Bulk Synchronous Parallel (BSP). The recent popularity of BSP led to it being chosen as the third method to be evaluated. The Fortran language was chosen as the target for development of the parallel codes, as it was shown to allow the largest choice of implementation techniques. For each method, implementation techniques were chosen: the shared memory code will be implemented using loop parallelising compiler directives; the message-passing code will be implemented using the MPI library; and the BSP code will be programmed using the WorldWide BSP (WWBSP) library. Target platforms were selected for each method: the shared memory code will be implemented on an SGI Challenge and a KSR1, and the message-passing and BSP codes will be implemented on an SGI Challenge and a Cray T3D. The chosen implementation techniques were discussed with respect to the target platforms for each method. Finally, technical details of the target platforms were supplied,

CHAPTER 3. METHODS TO BE EVALUATED

56

concentrating on information relevant to the chosen parallelising methods. With the evaluation criteria set, and the methods to be evaluated chosen, the implementation of the parallel codes can now proceed. However, before the implementation details are discussed, a novel, incremental development method for developing distributed memory model codes will be presented. This method will enable the two distributed memory model codes, the message-passing and BSP codes, to be developed in a systematic manner.

Chapter 4 Developing Distributed Memory Model Code


Two of the parallelising methods chosen, namely message-passing and Bulk Synchronous Parallel (BSP), use a distributed memory model. Usually, when writing such a code, the code will be rewritten in a single, monolithic process, during which intermediate versions of the code cannot be executed. This approach will likely result in the introduction of errors into the code, which is undesirable, as debugging parallel programs is a non-trivial procedure. This chapter presents a novel development method which allows the gradual evolution of a fully distributed code from a serial code. Firstly, the problems that a distributed memory model presents are examined, and comparisons with shared memory systems are made. In Section 4.2, an overview of the method is given, and then the method is explained fully in Sections 4.3 and 4.4. This method will then be used to develop the message-passing and BSP codes, with the details of these implementations being discussed in the next chapter.

57

CHAPTER 4. DEVELOPING DISTRIBUTED MEMORY MODEL CODE 58

4.1 Issues with Distributed Memory Codes


With a program of the type studied later in this thesis, and introduced in Section 5.1, the best way to obtain a parallel code is to identify loops whose iterations are independent of each other, and run iterations of these loops on di erent processors. This approach yields codes whose threads all follow the same control path through the code. When writing such parallel codes, a good distribution of work is essential if an e cient code is to be produced. When writing such codes using a distributed memory model, the distribution of data structures among the threads of a program is of paramount importance. Consider the situation with an array of 16 elements, and a program running on a four processor machine. If most uses of this array are in loops which involve all the elements being treated independently from one another, then the best way to distribute the array is to store four elements on each processor. This way, each processor can operate on its subset of the array simultaneously. Provided that each element takes approximately the same time to process, the load on each processor should be close to a quarter of the original load. If this is the case, all four processors will be busy while the loops are executed, yielding good e ciency. Unfortunately, there will be many cases where particular arrays may be operated on locally for part of the time, but will require operations, such as reductions, to be performed on the whole array. These are not as easy in the distributed model, because now inter-process communications must be considered. For example, the global sum of the array may be required on all processors, in order that a branch can be taken. Clearly, the best way of achieving this is for each processor to use inexpensive MC-LOCAL accesses to calculate a sum of all the array elements held locally, and then for the more expensive MC-REMOTE accesses to be used to get the other processors' local sums. After receiving the other three local sums, each processor can then calculate the global sum, and branch

CHAPTER 4. DEVELOPING DISTRIBUTED MEMORY MODEL CODE 59


appropriately, i.e. the same way as all the other processors. It is worth noting here that MPI provides calls to deal with common operations such as reductions, as mentioned in Section 3.3.1. But there are more di cult cases than these. Consider an array, over which there are many loops, with each iteration of the loop processing a di erent element of the array. There may be situations where it takes a di erent amount of work to process each array element, so that simply dividing the iterations into p evenly sized chunks, and then distributing these chunks among the p processors, may not be appropriate. This is not a problem if one distribution of the iterations is adequate for the entire program { where di culties will be encountered is when the distribution must change during the execution of the program. In the case of the shared memory model, directives are simply written to distribute the loop iterations in di erent ways, with the shared memory system handling any required communications. But, in the case of a distributed memory model, explicit communications must be added to redistribute the array. In addition to the task of coding the transfer of data, there is also the problem of a new local numbering scheme, e.g. if an array on a four processor machine is reorganised, sequential addresses on the rst processor might contain the rst, fth, ninth, etc. elements by the old numbering scheme. This is in sharp contrast to the shared memory model, where the same numbering scheme is maintained. This is not to say that the shared memory model will necessarily provide more e cient implementations. While explicit array redistribution is not necessary, at least the same amount of communication must be performed by the time the whole array is accessed using the new distribution of loop iterations { the data
1

While the same issues exist for shared memory implementations, in the case of the Fortran directives, schemes exist for performing di erent distributions of loop iterations. The KSR Fortran compiler directive even provides a scheme for allocating iterations dynamically, as mentioned in Section 3.5.2.
1

CHAPTER 4. DEVELOPING DISTRIBUTED MEMORY MODEL CODE 60


must somehow reach the processors reading it. Indeed, the most compelling argument for distributed memory models is that the programmer is forced to think much harder about the best data distributions, and about the associated communication patterns. However, any good shared memory programmer will also consider data distribution, as good spatial locality is important when considering both how to maximise cache use and to minimise false sharing.

4.2 Overview of Method


As stated earlier, to produce a distributed code using either BSP or MPI could involve substantial programming without being able to run the code. Without a way of smoothing the passage between the single address space model and the distributed memory model, it is likely that many coding errors will be introduced. It would be desirable to move to the distributed memory version slowly, with the move to MPI or BSP as a nal step. Obviously, there is bound to be a small leap as the code is changed to using BSP or MPI, but the smaller that step is, the fewer errors will be introduced and the easier the code will be to debug. What is required is incremental steps which allow the code to evolve towards the distributed memory model, rather than the revolutionary step of doing all the coding at once. The diagram in Figure 4.1 shows just such a development path for use on a shared memory machine. As can be seen from the diagram, each step itself is evolutionary, except for the nal small leap to BSP, and from BSP to MPI. It is possible to move from the distributed memory version to MPI directly (as is re ected in the diagram), but moving to the BSP version rst allows us to write a truly distributed code before making the communications two-sided. It should be noted that the method described is only appropriate for the same loop
2

The draft standard for MPI2 does contain one-sided messaging, which will bring MPI as close to the distributed memory version as BSP is now.
2

CHAPTER 4. DEVELOPING DISTRIBUTED MEMORY MODEL CODE 61


parallelism that the Fortran compiler directives are used for. The model allows a shared memory version of the code to be developed using the most primitive tools for programming a shared memory machine, as introduced in Section 3.2. This is important, as absolute control over what each thread is doing is retained. This version of the code also allows the way that data is stored to be slowly changed { as gradually as one array at-a-time { splitting the arrays, then altering the numbering system, nally making all reads local (data must be explicitly `sent' from its owning thread), thereby providing a distributed memory version of the code. To get a BSP implementation from this distributed memory code, the explicit writes to other processors become bsphpput commands, and the barrier synchronisations become bspsync calls. Areas which are written to by these writes must be `registered' with the BSP library as well. Together with stripping the last index away from all array references and replicated scalars (the index which represents the thread number), this should be all that is required. To move from BSP to MPI, it is simply a case of making communication two-sided, replacing BSP calls with MPI calls, and perhaps replacing some of the manual reductions with higher level MPI commands. Communications to other processors may also be manually combined, to reduce the number of messages sent. The move from serial code to threads and barriers is now explained in more detail, followed by a more detailed explanation of the changes in array distributions. When describing the method, the phrase `parallel region' will apply to any part of the code which is executed by all threads. All other parts will be `serial regions'.

CHAPTER 4. DEVELOPING DISTRIBUTED MEMORY MODEL CODE 62

Serial Version of Code

Threads and Barriers Parallel Implementation

Multiple Copies of Arrays (One Per Thread) Evolutionary Step Local Numbering System For Some Arrays Revolutionary Step

All Reads From Arrays From Own Copy Only

All Other Reads From Local Data Only

BSP Implementation

MPI Implementation

Figure 4.1: Proposed Development Path for Transition to Distributed Memory Codes

CHAPTER 4. DEVELOPING DISTRIBUTED MEMORY MODEL CODE 63

4.3 Serial Code to Threads and Barriers


The creation of a threads and barriers implementation is not trivial, even when previous study has revealed the best data distribution schemes, as well as which loops should be executed in parallel. Because of the potential for timing problems, it is unreasonable to expect to be able to go through the entire program making each processor work on its share of the loop iterations, placing barriers where necessary, and then expect the code to run correctly. The placing of barriers is a hard task { this is best illustrated by the attempts of the parallelising compiler community to perform this task automatically O'Boyle95] { so, again, the code must evolve incrementally, rather than undergo revolution. To start with, a gradual method is visualised, in which the code starts to execute in parallel, but where part-way through the code all but one of the threads terminate, leaving the rest of the code to execute serially. This would allow the desired, gradual parallelisation of the code. But, to parallelise the code, lines of code must be added to source les, and the execution path of the code will not simply follow the textual layout of the code (except in the trivial case that the program is loopless). The method which is described here allows each routine in turn to be dealt with almost exactly as a piece of text, irrespective of control ow. Although a method could be devised which would only work for simple loops, this thesis is concerned with legacy codes, which may have been written when goto statements were perfectly acceptable (i.e. before Dijkstra68]). It might be argued that one of the existing tools should be used to automatically remove gotos, but these do not guarantee success. So the approach presented must be as general as possible. The hardest situation for maintaining a line-by-line approach is when the parallel region must be extended into an area of the routine where the control ow is not simply a single pass, top to bottom. This includes sequential loops

CHAPTER 4. DEVELOPING DISTRIBUTED MEMORY MODEL CODE 64


(e.g. the steps of the hysteresis curve in the micromagnetics code, outlined in Section 5.1) and codes where goto statements are used, giving complex control ow patterns. Situations like these present problems because some parts of the code will be executed rstly by all the threads, but later, when the slave threads have returned, by just one thread. It must be ensured that all iterations of parallel loops are still executed, and that the master thread does not call barrier synchronisations when it is the only thread left. How the method deals with this situation will become clear later. The method for achieving this implementation consists of two parts: 1. a recursive method which allows subroutines and functions to be parallelised separately, and 2. a method for parallelising a subroutine or function body. The two parts will now be explained in detail.

4.3.1 Parallelising Routines and Functions Separately


This part of the method starts with a trivial alteration to the main routine of the program, as shown in Figure 4.2. The goto command is required in order to provide a way for the slave threads to leave the routine irrespective of whether they are inside a loop nest or a conditional statement. The language used in the examples is Fortran, although this method is applicable to any language which can call the appropriate system commands for starting multiple threads, performing barrier synchronisation, and locking data. In this case, the program started out as just the <program header> containing common de nitions, etc., and the <program body>. The system routines used are START THREADS, to initialise the multiple threads, and END THREADS, to kill them, BARRIER, to perform a barrier synchronisation, and GET THREADS ID, which returns the calling thread's numerical identity { the master thread is assumed to have identity zero.

CHAPTER 4. DEVELOPING DISTRIBUTED MEMORY MODEL CODE 65


program example parameter(noofthreads=4) <program header> integer myid,nthreads,GET_THREAD_ID nthreads=noofthreads call START_THREADS ( nthreads ) myid=GET_THREAD_ID ( ) if (myid.ne.0) goto 9999 nthreads=1 <program body> 9999 call BARRIER ( ) call END_THREADS ( ) end

Figure 4.2: Initial Alteration to Serial Code The example is trivial. Multiple threads are started, but only one of them (thread zero) executes the program body { the others wait at the barrier routine, after which they are killed. Now, the approach is for the genuinely parallel part of the code to encroach upon the program body, statement-by-statement, until it reaches the bottom. As stated earlier, this part of the method will be dealt with separately. As the parallel region of the code progresses through the main routine's body, subroutines and functions will be encountered. The rst time a given subroutine (or function) is encountered, it must be examined closely to decide what to do. If the subroutine (or function) contains no opportunities for parallelism, i.e. the source for the called routine is not available, or it contains no loops over su ciently large arrays of data and no calls to routines which do, then no alteration of it is made. Similarly, if the subroutine or function is called from within a

CHAPTER 4. DEVELOPING DISTRIBUTED MEMORY MODEL CODE 66


subroutine sub1 parameter(noofthreads=4) <subroutine header> integer myid,nthreads,GET_THREAD_ID nthreads=noofthreads myid=GET_THREAD_ID ( ) if (myid.ne.0) goto 9999 nthreads=1 <subroutine body> <function body> 9999 call BARRIER ( ) return end copy=<function result> 9999 call BARRIER ( ) func1=copy return end function func1 parameter(noofthreads=4) <function header> integer myid,nthreads,GET_THREAD_ID <type> copy nthreads=noofthreads myid=GET_THREAD_ID ( ) if (myid.ne.0) goto 9999 nthreads=1

Figure 4.3: Alterations to Subroutines and Functions loop whose iterations have been divided amongst the threads, then again nothing need be done, as, when the subroutine (or function) is called, all of the available threads will already be busy with their own work { so any further potential parallelism cannot be exploited. Conversely, if the subroutine or function does contain opportunities for exploitable parallelism, then the name of the subroutine or function should be added to a list of subroutines and functions requiring to be parallelised. The subroutine or function should also be altered (the rst time it is encountered), as shown in Figure 4.3. The <type> in the gure refers to the type of the function. The advantage that this gives is that the subroutine may be called from any part of the code, from either a serial or a parallel region. Again, the goto command is required to allow the slave threads to be treated in the same way, independent of the control ow. Once the main routine's body has been parallelised, the next step is to repeat the process for a subroutine or function from the list. Any subroutine (or function) which is now only being called from parallelised regions of code may be chosen.
3

The only time that this method may not work is if there is recursion in the program. As recursion is not supported in Fortran 77, this issue will not be dealt with here.
3

CHAPTER 4. DEVELOPING DISTRIBUTED MEMORY MODEL CODE 67


This method is followed repeatedly, treating each chosen subroutine the same way as the main routine, until all subroutines have been parallelised.

4.3.2 Parallelising a Single Routine or Function Body


The method of parallelising a routine body will now be described in depth. Each statement in turn must either be executed by the master thread only, or executed by all threads. In general, the rule is that statements which change the control ow of the program (such as if statements) are executed by all threads, and all other non-loop statements are executed by thread zero only. The rules for Fortran 77 loops and subroutine and function calls are shown, in the form of a decision tree, in Figure 4.4. The transformations mentioned in the diagram are shown in Table 4.1. It should be noted that this part of the method is intended more as a guide. There will be some loops where it would be desirable to divide the iterations among the threads, but where some data dependence, or some shared data item, prevents this from being done. In these cases, the body of the loop will have to be carefully examined. It may be possible to use locks to ensure the threads do not interfere, or to rearrange the loop in such a way that the data dependence disappears. Clearly, the issue of data dependence is far too large to be dealt with here. Indeed it is one of the main problems when trying to exploit loop parallelism (see Sakellariou96]). The conditional goto code block, as shown in Table 4.2, was introduced into the code body as part of Figure 4.3, where it appears at the top of the routine. After the rst statement in the routine (i.e. the statement immediately succeeding the goto code block) is processed, then the goto code block is moved to just after that statement. This process is repeated, so that after each subsequent statement is processed, the goto code block is moved to just below the processed statement. When processing the routine body, barriers should be placed after each loop

CHAPTER 4. DEVELOPING DISTRIBUTED MEMORY MODEL CODE 68

Decision Action Is statement a function call, a subroutine call or the start of a loop? Loop Function

Figure 4.4: Decision Tree for Individual Code Statements

Subroutine

Are loop iterations to be divided amongst the threads?

Is the subroutine which is being called to be parallelised?

Is the function which is being called to be parallelised?

Yes

No

Yes

No

No

Yes

All threads execute loop Transform DO statement Leave body of loop as it is

All threads execute loop Use decision tree for statements in loop body

All threads execute the call

Only master thread executes the call

Is the result of the call stored? No

Yes

All threads execute the call

Transform function call (All threads execute)

CHAPTER 4. DEVELOPING DISTRIBUTED MEMORY MODEL CODE 69


Transformation Old Code Loop where iterations do 100 i=1,n are to be divided amongst the threads New Code

Function called by all threads where result is stored

itile=(n+nthreads-1)/nthreads istart=1+myid*itile iend=min((myid+1)*itile,n) do 100 i=istart,iend dest=func1(...) if(myid.eq.0)then dest=func1(...) else dummy=func1(...) endif

Table 4.1: Simple Code Transformations


Name Code block Barrier code block if(nthreads.gt.1)call BARRIER Conditional routine if(myid.ne.0) goto 9999 and function exit code nthreads=1
( )

Table 4.2: Code Blocks used in Routine Bodies where the iterations are shared amongst the threads. A barrier should also be placed at the end of any group of statements which execute on the master thread only. For these purposes, any instances where the function transformation was used are treated as `master thread only' statements. To place a barrier, the code block shown in Table 4.2 is used. Placing a barrier after each parallelised loop, and after each block of statements which are executed only by the master thread (including situations where the function transformation was used) may place more barriers than are strictly required, but the code will run without race conditions. Also, there will be opportunities for removing these in later steps of the method (see Section 4.4). The reason that this method works for complex control ows is due to the way in which parallelised loops are coded. Because the number of threads is used when calculating the work for each thread, the loop works whether one or many threads execute the loop { all that is required is that the value of nthreads is

CHAPTER 4. DEVELOPING DISTRIBUTED MEMORY MODEL CODE 70


changed when the number of threads executing the routine changes, as is done in the goto code block. As the parallel region of the routine is expanded, conditional branches may be encountered, which provide two possible paths through the code. These paths will have to be treated separately, with each path being given its own conditional goto code to send the slave routines to the end of the routine body. The parallel region may then be advanced through each of these paths separately. After a routine has no serial regions remaining, i.e. there is no conditional goto block to send the slave threads to the end of the routine body, the conditional part of the barrier statements may be removed.

4.4 Progression of Data Distributions


As illustrated in Figure 4.1, after the threads and barriers implementation has been obtained, the data distribution of the program's arrays must be altered, in order to produce distributed memory code. Shared scalars present in the code must also be addressed, but this will be done later. There are restrictions with this method, and it would require modi cation if it was used in situations where the data distribution of an array were to change during the execution of the program. The following steps may be carried out separately for each array. Firstly, for p processors, the declaration of the array is altered, so that p copies of the array are declared. This is done simply by adding another dimension (of size p) to the array. For the program to compile, clearly an extra index must be added to all statements which read or write the array. This value is calculated in such a way that only 1=pth of each copy of the array is used. It is assumed that this is done by partitioning the array on the rst index, dividing the array into p contiguous chunks. Indeed, it may be worth altering the program rst in order to make this a suitable data distribution, as it greatly simpli es future stages.

CHAPTER 4. DEVELOPING DISTRIBUTED MEMORY MODEL CODE 71


The arithmetic for performing the calculation is similar to the arithmetic used to divide the loop iterations among processors (see Table 4.1), but, given that loop iterations have already been distributed amongst the processors, it should be the case that thread number n ? 1 accesses the nth part of the array. In these cases, myid + 1 can be used for the last index. If all accesses to an array have myid + 1 for the last index, or at least all accesses which occur after the initialisation phase of the program, then it will be easy to switch to a local numbering system, i.e. numbering each local segment of the array starting from one. But, if there are many accesses to other parts of the array, then it may be appropriate to replicate it, either completely or partially. For an example of where replication is appropriate, see Section 5.4. Depending on the decision taken, the second phase for an array { to use a local numbering scheme { may be skipped. Clearly, when changing numbering schemes, all references to an array must be changed before the code can be run again. The third phase is to ensure that all reads from arrays by a thread are performed from local data, and for any writes to non-local data to be separated clearly from local writes. These non-local writes will become explicit communications when BSP or MPI is used. For any particular array, this involves resolving all reads which are still not local, i.e. reads which have a nal index which is not myid + 1 { no matter what numbering scheme is in place. Code will be added to copy the required data explicitly to the reading threads' local segment of the array, and the non-local reads which were present before will become local. If there is currently no barrier between these new writes and the reads, then clearly one must be added. The nal step for the data distribution is to resolve all uses of shared scalar variables, so that all reads will be from local data. Obviously copies of any data items which are used to a ect control ow, e.g. variables used in the deciding

CHAPTER 4. DEVELOPING DISTRIBUTED MEMORY MODEL CODE 72


comparison of an if statement, will be required on all threads. There are two options here: either the result can be copied from the master thread to the slave threads; or computation of the result can be replicated on all threads, and the need to communicate the result avoided. An additional advantage of replicating computation is that it may be possible for some of the barriers which were placed earlier to be removed. Again, for an example, see Section 5.4.

4.5 Summary
The development of distributed memory codes is a di cult task which has typically been performed in a single revolutionary step. As two of the three parallelising methods selected for evaluation in this thesis use a distributed memory model, a more incremental approach was sought. Following a discussion on data distribution issues related to using distributed memory models, just such an evolutionary development method was presented. This novel approach was shown, through a series of simple steps, to evolve distributed memory model codes from serial codes. Following an overview of the method, the two main phases of the method, namely the conversion of the serial code to a threads and barriers implementation, and the distribution of shared data structures, were explained in detail. As the components of the development method were explained in depth, each was shown to be evolutionary in its own right, allowing the code to be run at frequent intervals. This development method will now be applied in the implementation of the message-passing and Bulk Synchronous Parallel (BSP) codes. These implementations are discussed, together with the development of the shared memory code, in the next chapter.

Chapter 5 Development of Codes


Following selection of the evaluation criteria and the three parallelising methods to be evaluated, the next task is to write the parallel codes. This chapter deals with di erent aspects of the development of the the codes, and their di erent versions. Firstly, the N-body serial code which is to be parallelised will be introduced. To give the problem context within the eld of N-body problems, its complexity relative to other N-body problems will be discussed. After the introduction of the serial code, the details of the production of the three parallel codes will be given. Firstly, the shared memory code will be discussed, including details of the overhead analysis which was performed. Next, the BSP code will be dealt with, including a discussion on the potential for tuning this code. As the BSP code was the rst distributed memory code to be written, this section will include the experiences of using the development method which was described in Chapter 4. Finally, the development of the MPI code will be examined, including the vital last step of combining communications.

73

CHAPTER 5. DEVELOPMENT OF CODES

74

5.1 Serial Code


The serial code which has been parallelised is a micromagnetics program for simulating the e ects of an external magnetic force on thin lm media, such as disk surfaces. For details of the code and the theory behind it, see Miles91]. A thin lm media is made up of a two dimensional surface, to which a number (1000 to 20000) of magnetisable grains are attached. The application of an external magnetic eld changes the magnetic elds of the particles. Because the grains' magnetic elds all interact with each other, an equilibrium must be found. The program nds this equilibrium by solving a system of Ordinary Di erential Equations (ODEs). In its most common mode of operation, the code starts by applying a negative external eld with a certain magnitude. The magnetic elds of the particles are then found by the above method. The value of the external eld is then repeatedly changed, stepping linearly through zero, until it is positive with the same magnitude as it started with { this typically takes 400 steps. At each step, the magnetic elds of the particles are found, using their nal orientation from the previous step as initial values for the current step. The overall magnetisation of the grains follows the change in eld, producing one half of a hysteresis loop. A problem which concerns a large system of objects which all exert forces on each other is called an N-body problem. N-body problems encapsulate a wide range of problems, from the code described above, to simulations of collisions between galaxies. Because of the total inter-particle interactions, solving these problems precisely would take execution times of O(N ). If large problems are to be solved, some way of trading o precision to get better execution time is required. Intuitively, the further away two bodies are from each-other, the less e ect they have on each-other. However, the in uence which two particles have
1 2

This dependency on previous steps prevents an easy form of parallelism where the magnetisation of the grains could be found for di erent external eld values on di erent processors.
1

CHAPTER 5. DEVELOPMENT OF CODES

75

on each-other decreases slowly enough, as they are moved further apart, as to mean that their interaction may never be disregarded. Another approach is to approximate interactions with further away particles by grouping them spatially into so-called cells, and treating these cells as single particles. The further away the particles are, the larger the size of cell that can be used. To do this systematically, the problem space can be repeatedly subdivided, giving a hierarchy of cells. As interactions with a given particle are being computed, the further away from the particle the calculation gets, the larger the size of cell that can be used to compute the interactions. Indeed, most N-body algorithms are based upon two methods which do just this, namely the seminal works of Barnes & Hut Barnes86], and Greengard & Rokhlin Greengard87]. In both of these methods, the desired precision can be maintained by controlling both the distance at which each level of the cell hierarchy is used, and also how accurately groups of particles are approximated into single `particles'. These methods both give execution times of O(NlogN ) or better. The method described by Greengard & Rokhlin { the Fast Multipole Method (FMM) { tries to improve upon the Barnes-Hut method by using multipole approximations for the cells, rather than dipole approximations, and by allowing direct cell-cell interactions, as well as particle-cell and particle-particle. Indeed, the authors claim their method gives execution times of O(N ), although the method has been shown to yield only O(NlogN ) in problems where particles are unevenly distributed Singh93]. It is useful to consider the wide variations in complexity which are present in the eld of N-body problems. Figure 5.1 shows a hierarchy of N-body problems based on the distribution of the bodies and their freedom to move, and even combine or split. Considering the micromagnetics code, it can be seen that in terms of this hierarchy of N-body problems, it is relatively easy, belonging to

CHAPTER 5. DEVELOPMENT OF CODES

76

Bodies can move, and may split or coalesce

Bodies positions are fixed

Bodies are stationary and well distributed

Figure 5.1: Hierarchy of N-Body Problems category D. The gaps between levels in this hierarchy are emphasised when parallel implementations are considered. For instance, whether the bodies are well distributed in the problem space will make no di erence to a sequential implementation, but will make partitioning the problem space onto multiple processors di cult for a parallel implementation. In the case of the provided micromagnetics code, a Barnes-Hut type algorithm is used, but the number of levels in the cell hierarchy is limited to two, as shown in Figure 5.2. Limiting the hierarchy to two levels makes the code easier to write in Fortran 77, which does not support dynamic data structures. However, the code does not scale to large problem sizes, as eventually the level-two interactions will dominate, causing execution times to be O(N ) for larger problem sizes. Although the grid parameters are con gurable, typically there are nine levelone cells per level-two cell, with each level-two cell containing 48 magnetisable grains. As can be seen in the diagram, exact interactions are performed with all grains in the nearest 25 level-one cells. After this, there are direct interactions
2

Complexity

Bodies are free to move in the problem space

CHAPTER 5. DEVELOPMENT OF CODES

77

Target grain Grain used for particle-particle interaction Level 1 cell used for particle-cell interactions

Level 2 cell used for particle-cell interactions

Figure 5.2: Cell Hierarchy Used in Serial Code with all level-one cells contained within the surrounding eight level-two cells. More distant interactions are calculated directly with level-two cells. In order that all grains should have the same number of exact interactions, and levelone and level-two cell interactions, the problem area is assumed to wrap around. This means that particles near the edges of the problem area interact with some particles and cells near the opposite edges. It is important to note that, although the rectangular area may appear to be mapped onto a torus, this is not actually the case, as all interactions are based upon a at problem space. Due to this boundary condition, when interactions are calculated between a grain and another grain or cell, there are two possible separations in both axes. In all cases, the shortest separation for both axes is used. When calculating the sum force exerted by a cell, a simple dipole approximation is used, which is exerted from the centre of gravity of that cell. The problem size is set in the program, by setting two integer parameters, nl2x and nl2z. These set the number of level-two cells (see Figure 5.2), the largest cell size used by the program. The problem sizes are naturally described in these terms, with `4x6' representing a problem size with nl2x set to 4, and nl2z set to 6. The smallest problem size with which the code will run is 4x4. Typically, small, square problem areas are used when running the code, such as

CHAPTER 5. DEVELOPMENT OF CODES

78

4x4, 5x5 and 6x6. The use of a xed 2-level hierarchy means that, as problem sizes get larger, the execution times of the program will scale as O(N ). However, before this becomes a problem, the method used to solve the Ordinary Di erential Equations breaks down. At one point in each half of the hysteresis loop, there is an extremely short interval of catastrophic change, where almost all of the magnetisable grains completely change the orientation of their magnetic eld. Because of the huge amount of change, the ODEs are hard to solve at this point. As the problem size increases, the equilibrium becomes harder to nd, not only because of the number of equations in the system, but also because of local changes in the magnetic elds which are propagated by exchange e ects. These local changes are like ripples left on a pond after a large disturbance { the e ects move around the problem area, and, if the problem area is su ciently large, are signi cant enough to prevent the equations from ever converging. To solve problems of such large sizes, the model parameters must be changed to make exchange e ects less signi cant.
2

5.2 Devectorising for Coarse Grained Parallelism


The serial code, mentioned in Section 5.1, was tuned earlier in its life in order to run e ciently on a vector processor. As mentioned in Section 2.2, older vector compilers could only produce vectorised object code for loops made up of just a few lines of code. For the three methods of parallelisation chosen, coarse-grained parallelism is required, i.e. loops with lots of work in each thread's allocated iterations, in order to minimise synchronisations. To achieve this for this particular code, loops often have to be fused. Loops may only be fused if they have the same index, and there are no data dependencies which prevent the loop bodies

CHAPTER 5. DEVELOPMENT OF CODES

79

from being combined in this way. As the example code, shown in Figure 5.3, demonstrates, loops often did not have the same index. This example is loosely based on part of the `dmdt' routine in the micromagnetics code. This routine is responsible for roughly 80-90% of the execution time, and the original code for the entire routine can be found in Appendix A.1. Figure 5.3 shows three loop nests which add the contributions to the force acting on each grain. The rst loop initialises elements of the array which will be used to contain the force data with an initial value based on the results from the previous iteration. The second loop nest adds exact contributions from neighbouring grains, and the third adds contributions from level-one cells. There are other loops to deal with level-two contributions, as well as exchange contributions (from any grains which are physically touching), which have been omitted in order to keep the example simple. To further simplify the example, only calculations for the part of the eld acting in the x-dimension are shown { there exist corresponding arrays for the y-axis and z-axis contributions. The code has also been indented. When the inner and outer `DO' loops are reversed, the code (which is still sequential at this point) is seen to take around 50% longer to run. This is due to cache-e ects. In the standard for Fortran 77 ANSI78], it is stated that, in any implementation of the language, the array items must be stored in memory in such a way that elements (n,...) and (n+1,...) of any array must be adjacent to each other in memory, in ascending order. Now consider the neighbour list array, nliste, which is accessed in the middle loop nest. Because of the way the loops are written, the memory locations containing the array nliste are accessed sequentially. So, when the array is rst accessed, a cache line with the rst few elements of the array will be loaded into the cache. This means that subsequent accesses will use the data already contained in the cache. Data for this array will be loaded into the cache only after all the data in the last loaded line has been

CHAPTER 5. DEVELOPMENT OF CODES

80

do 100 igr=1,ngr dhxt(igr)=dble(hxext(igr)+ + +dble(rd31(igr))*dmx(igr) + +dble(rd32(igr))*dmy(igr) + +dble(rd33(igr))*dmy(igr) 100 continue do 500 jgr=1,neighb do 400 igr=1,ngr dmxij=dmx(nliste(igr,jgr)) dmyij=dmy(nliste(igr,jgr)) dmzij=dmz(nliste(igr,jgr)) dspij=dmxij*dble(xij(igr,jgr)) + +dmyij*dble(yij(igr,jgr)) + +dmzij*dble(zij(igr,jgr)) dhxt(igr)=dhxt(igr) + +dble(fij(igr,jgr)) * *(dspij*dble(xij(igr,jgr))-dmxij) 400 continue 500 continue do 1500 i=jl1,nl1dim do 1400 igr=1,ngr dmxij=dm1x(nlist1(igr,jl1)) dmyij=dm1y(nlist1(igr,jl1)) dmzij=dm1z(nlist1(igr,jl1)) dspij=dmxij*dble(xij1(igr,jl1)) + +dmyij*dble(yij1(igr,jl1)) + +dmzij*dble(zij1(igr,jl1)) dhxt(igr)=dhxt(igr) + +dble(fij1(igr,jl1)) * *(dspij*dble(xij1(igr,jl1))-dmxij) 1400 continue 1500 continue

Figure 5.3: Example Vectorised Code

CHAPTER 5. DEVELOPMENT OF CODES

81

used. So, most times this array is accessed, the required data will already be present in the cache, which is highly e cient. Consider now the case with outer and inner loops reversed. The new access pattern for the array's elements means that each read will cause a cache line to be read from main memory. Because the outer loop is large (ngr is at least 768), by the time the loop gets round to using one of the elements which was loaded the rst time the array was accessed, it will no longer be in the cache. This means that every access to the array will now cause an access to main memory. This explains the degradation in performance. The same behaviour also occurs with the nlist1, fij1, xij, yij and zij arrays. Fortunately, this is the only place where these arrays are read. This means that the indices of the array can be reversed without harming performance elsewhere in the code. Obviously this has to be done everywhere the arrays are used, but this is only in the loops shown above, and in the setup phase of the program. The loops in the setup code required fusing in the same way, so the reversal of indices helped there, as well. After reversing the loop indices, fusing the loops is a trivial task, resulting in the code shown in Figure 5.4. This version of code saw an improvement in performance over the code from Figure 5.3, with the entire program running approximately 5% faster. As can be seen, after fusing the loops, the dhxt array is not required until after the loops, where it is accessed just once. This again helps performance, as the scalar dtmpx may be placed in a register during the loop. This process of loop-fusion has been performed mainly in this routine, and in the setup code { generally it was not required elsewhere. This reveals the logic behind the original optimisation for the vector machine, viz. of concentrating on optimising the most expensive routines. The only other place in the program

CHAPTER 5. DEVELOPMENT OF CODES

82

do 1500 igr=1,ngr dtmpx=dble(hxext(igr)) + +dble(rd11(igr))*dmx(igr) + +dble(rd12(igr))*dmy(igr) + +dble(rd13(igr))*dmz(igr) do 400 jgr=1,neighb dmxij=dmx(nliste(jgr,igr)) dmyij=dmy(nliste(jgr,igr)) dmzij=dmz(nliste(jgr,igr)) dspij=dmxij*dble(xij(jgr,igr)) + +dmyij*dble(yij(jgr,igr)) + +dmzij*dble(zij(jgr,igr)) dtmpx=dtmpx+ + +dble(fij(jgr,igr)) + *(dspij*dble(xij(jgr,igr))-dmxij) 400 continue do 1400 jl1=1,nl1dim dmxij=dm1x(nlist1(jl1,igr)) dmyij=dm1y(nlist1(jl1,igr)) dmzij=dm1z(nlist1(jl1,igr)) dspij=dmxij*dble(xij1(jl1,igr)) + +dmyij*dble(yij1(jl1,igr)) + +dmzij*dble(zij1(jl1,igr)) dtmpx=dtmpx + +dble(fij1(jl1,igr)) + *(dspij*dble(xij1(jl1,igr))-dmxij) 1400 continue dhxt(igr)=dtmpx 1500 continue

Figure 5.4: Example Code with Loops Reversed and Fused

CHAPTER 5. DEVELOPMENT OF CODES

83

where signi cant optimisation had been performed was the setup routine, which calculates the neighbour lists for each particle, as this contains loops which are of complexity O(N ).
2

5.3 Shared Memory Code


This section gives an outline for the methods used to tune the shared memory version of the code. Initially, parallelising compiler directives were placed in the single routine responsible for 80-90% of the execution time of the program. This quickly produced a parallel version, to be referred to as SM-ONEP, but was not very e cient. Next, all loops over appropriate arrays were parallelised. There was only one such loop which could not be parallelised, due to the use of a shared data structure. This loop is in the setup code, and is of complexity O(N ), which, compared to the loops in the setup code which generate neighbour lists for grains (which are O(N )), is not signi cant. This produced a code, SM-ALLP, which was signi cantly more e cient for every problem size and <number of processors> combination, but which still did not provide good e ciency. When attempting to improve the performance of a parallel code, it is possible to consider the performance purely in terms of execution times, attempting to reduce the execution time as far as possible using adhoc methods. To give a realistic target for how low the execution time should go, the ideal execution time can be calculated, as stated in Section 2.3, by dividing the fastest serial execution time for that problem size by the number of processors. Using this ideal, the tuning process can be seen as an attempt to reduce the gap between the actual and ideal execution times. One approach which may help when trying to reduce this gap, is to attempt to account for it precisely { this approach is known as Overhead Analysis. Here, the di erence between the actual and ideal
2

CHAPTER 5. DEVELOPMENT OF CODES

84

execution times is said to be due to temporal overheads, and using some scheme of classi cation, this di erence can be broken down into parts, and hopefully accounted for. The main hindrance in this process is that often it is di cult to measure these overheads precisely. The classi cation of overheads presented in Bull96], which is novel because of its hierarchical approach, will be followed below. First, the code's temporal overhead is broken down into four categories:

Information Movement { which covers memory access and synchronisation


costs;

Critical Path { which includes load imbalance, replicated computation, and


overheads due to insu cient parallelism (i.e. code which is not parallelised, or is only partially parallelised);

Control of Parallelism { which is time spent in user code concerned with


scheduling the parallel tasks, and any time spent in the run-time system code; and

Additional Computation { which is concerned with overheads resulting from


algorithmic changes and/or implementation changes. For the shared memory code, several of these overheads do not apply. As the only changes to the devectorised code were the addition of compiler directives, there are no instances where computation has been replicated, and no algorithmic or implementational changes have been made. The cost of unparallelised code and load imbalance are relevant, as are memory access overheads and synchronisation costs, the last of which covers the cost of using a loop-parallelising compiler directive. The only `Control of Parallelism' overhead would be time spent executing code generated by the compiler to perform the synchronising of the parallel

CHAPTER 5. DEVELOPMENT OF CODES

85

loops, but this has already been accounted for under the heading of synchronisation costs. This leaves us with just the following four relevant overheads: 1. the cost of unparallelised code, 2. the cost of using loop-parallelising compiler directives, 3. the cost due to load imbalance, and 4. memory access overheads. The rst step is to measure the time spent in unparallelised code, as time spent in unparallelised code is pure overhead. In order to measure this accurately, the code must be run on a single processor, and the time spent executing serial sections of the code (i.e. everything except parallelised loops) measured. The overhead due to the startup and shutdown (synchronisation) costs of executing parallel loops can also be calculated by counting the number of parallel loops executed, and multiplying this number by the cost for a single loop. The single-loop costs have been measured, and are set out for the SGI and the KSR1 in Tables 3.2 and 3.3, respectively. To look at load imbalance, the Event Logger tool (ELOG) on the KSR1 can be used. This tool works via simple calls from each thread of a program which register when a particular point in the code has been reached. Each call uses a number to signify a particular event, and user-de ned events may be added. By placing calls to register when each thread is starting and nishing a parallelised loop, the length of time that each thread takes to execute the loop may be compared. If the threads do not take the same length of time to execute each loop, then there is load imbalance. To identify precisely where the imbalance is happening, additional events may be de ned which signify when a particular
2

The timing must be performed on a single-processor run of the code, so that remote access costs do not exaggerate the overhead.
2

CHAPTER 5. DEVELOPMENT OF CODES

86

subroutine is entered or returned from. To assimilate the timing data generated by ELOG manually would be an enormous task, so a visualisation tool, `Gist', is also provided. A typical output from `Gist' is shown in Figure 5.5, with the length of time between certain events indicated by raised areas. By using these tools with appropriately de ned events, load imbalances can easily be seen. Once load imbalance is detected, and its source located, the next step is to eliminate, or at least attempt to reduce, the imbalance. In the case of the loopparallelising directives on both the SGI Challenge and the KSR1, the default action is to divide the iterations evenly between the threads by allocating each thread a contiguous chunk of iterations. This strategy will provide a balanced distribution of work provided that each group of iterations requires the same amount of work as the others. In cases where it does not, a more detailed study of the loop will be required in order to work out a better distribution of iterations. For example, in the code shown in Figure 5.6, the work required to execute the inner loop increases linearly for each subsequent iteration of the outer loop. One approach would be to calculate how many iterations of the outer loop each processor should perform in order to allocate each processor an equal amount of work. A simpler solution is to interleave the iterations of the loop onto di erent processors, so that for p processors, the rst processor would be allocated iterations 1, p + 1, 2p + 1, etc.. The second method has the advantage of simplicity, but, while providing good load balance, it will result in poor use of cache, as all processors will read and write elements of the array a which will be on the same cache line as each other. This will cause the other processors' copies of the cache line to be invalidated, even though the processors never attempt to access the same elements of a. This phenomenon is known as false sharing, and impacts severely on performance. The occurrence of false sharing when using interleaving schemes is covered in Lilja94], along with other issues concerning load imbalance.

CHAPTER 5. DEVELOPMENT OF CODES

87

Figure 5.5: Example `Gist' Visualisation Tool Display

CHAPTER 5. DEVELOPMENT OF CODES


do 100 i=1,nsize dtemp=0.0d0 do 200 j=1,i dtemp=dtemp+a(i)*b(j) 200 continue a(i)=dtemp 100 continue

88

Figure 5.6: Fortran Loop with Iterations of Di ering Costs The above example shows how di cult load balance can be to eliminate, even in simple cases where the amount of work required to process each iteration of the loop is clear at compile-time. As well as using the ELOG and `Gist' tools to observe changes in load balance, it is also possible to measure load imbalance quantatively. For a particular loop executed on p threads, each thread will ideally take 1=pth of the total time taken by all threads to execute the loop. The load imbalance is the di erence between the longest time taken by any thread to execute the loop and this ideal, which is the mean of the threads' execution times. If the time taken to execute the loop on a particular processor, i, is denoted by Tloop(i), then the load imbalance overhead on p processors, Timbalance (p), is de ned thus:

Pp Tloop(i) Timbalance (p) = max Tloop(i) ? i


i=1 p
=1

(5.1)

To account for load imbalance numerically in a particular parallelised loop, the time taken to execute the loop on each thread must be measured. This is possible on the KSR1, because of the use of the `tile' and `end tile' statements to denote a parallel part of the program. Statements may be placed between these directives but outside the loop which is being parallelised, so that the execution time of the loop can be accurately measured on each thread. However, on the SGI Challenge, the next line of code following the `$DOACROSS' directive must be the `DO' statement of the loop to be parallelised. The end of the loop is detected

CHAPTER 5. DEVELOPMENT OF CODES

89

automatically, and so only iterations of the parallelised loop may be executed on every thread, making calls to high resolution timers impossible for slave threads. To measure memory access costs accurately, hardware support is required in the form of both counters, to count key events such as subcache and page misses, and also timers, to measure quantities of interest such as the length of time a processor spends waiting for memory accesses (known as the stall time). Such facilities exist on the KSR1, and together with the system calls which access the information, form the PMON (Performance MONitor) utility. These measurements are useful when trying to improve the memory access patterns of a program, as they can be used to show whether a change to the code results in improved or worsened memory access. Both improvements to spatial locality, and the devising of better schemes for data distribution should improve memory access patterns, and the PMON statistics can be used to show this, by measuring stall time for the master thread on di erent versions of the code. The main scope for improving memory access times is to improve on the use of cache. Reorganising arrays so that they are accessed sequentially will almost certainly decrease stall time, although this may not be worthwhile if the arrays must be redistributed during the running of the program, as this will give us a new overhead due to implementation changes. As with load imbalance, the PMON measurements can be used to provide a quantitative measurement of memory access overhead. The rst step is to measure the stall time for a given loop on a single processor run of the code. If this is denoted Tseqstall, then, ideally, each processor on a p processor run will only stall for Tseqstall p . If the stall time for executing the loop is measured on each processor as Tparstall (i), then the memory access overhead for the loop is the di erence between the largest parallel stall time and the ideal, i.e.:

Tseqstall p Toverstall (p) = max T ( i ) ? parstall i p


=1

(5.2)

CHAPTER 5. DEVELOPMENT OF CODES

90

As mentioned in Section 2.3, it is sometimes possible for superlinear speedup to occur. This term refers to the situation where the actual execution time of a code is faster than the `ideal' execution time. For this to happen, the total temporal overhead of the code must be negative. As reasoned earlier, in the case of the shared memory code, there are only the four overheads which may contribute to the temporal overhead. Of these, clearly unparallelised code and synchronisation costs will always be positive, and load imbalance can at best be zero (see Equation 5.1). So, for superlinear speedup to be encountered, the memory access overhead must be negative, and of su cient magnitude to cancel the e ects of the other three overheads. This unusual situation will usually be due to cache e ects. When running a code on two processors, there is twice as much cache available (and on a distributed memory machine, twice as much local memory available) for the program to use. This may lead to the entire code tting in local memory where previously it exceeded the available space, relying on the virtual memory system to swap parts of the problem in and out of the physical memory. The elimination of this swapping results in the avoidance of costly page misses, and may well provide superlinear speedup. The purpose of tuning the shared memory code is to increase the performance of the code. To achieve improvements, both the load imbalance and the memory access overheads of the code will be examined, and if possible, reduced. As mentioned above, the precise measurement of load imbalance is not necessary in order to improve it; and the same applies for memory access overheads. Here, the ELOG, Gist and PMON tools will be utilised in order to analyse the code's overheads, but the overheads themselves will not be quanti ed. A more numerical treatment of overhead analysis can be found in Riley96]. The overhead analysis phase will conclude the development of the shared memory code. Although the load imbalance and memory access analyses can only

CHAPTER 5. DEVELOPMENT OF CODES

91

be performed on the KSR1, it is expected that any performance improvements on that system will be re ected by similar improvements on the SGI Challenge. The code arrived at after the analysis using ELOG and Gist will be known as SM-ELOG, and the nal code produced after memory overhead analysis will be known as SM-PMON.

5.4 Bulk Synchronous Parallel Code


The initial Bulk Synchronous Parallel (BSP) code was developed from the serial code using a distributed `threads and barriers' implementation as an intermediary. This intermediate code was developed using the method described in Chapter 4. As stated in Section 3.4, the destination platforms were the SGI Challenge and the Cray T3D. As the development method requires a shared memory machine, all work on the threads and barriers implementation was performed on the SGI Challenge. As this was the rst distributed code to be developed, it was here that the data distribution was xed. The easiest way of partitioning data onto several threads was to divide the problem area up in terms of level-two cells, distributing level-one and individual grain data based on membership of level-two cells. Although this may cause load imbalance for certain problem size and <number of processor> combinations, to divide the data any other way would greatly complicate the communication patterns of the code. This led to a reduction in the number of places in the program where code must be added to handle communications. The threads and barriers code was developed using the incremental development method. Following Section 4.4, data structures were distributed according to the scheme mentioned above. Of particular interest were three arrays, dmx, dmy and dmz which are used when calculating the exact interactions with nearby grains. Clearly, with any distribution of grains to threads, data about grains

CHAPTER 5. DEVELOPMENT OF CODES

92

from other threads will be required, even if data is not required from all threads. In this case, it was decided to retain the original numbering. This means that data from other threads is written into the appropriate places in these arrays. This helps to simplify the code, because it allows the loop calculating the exact interactions can simply read from the arrays as normal { the required data will be present. Obviously, there is a trade-o between complexity of code and memory usage { the larger the problem, the more empty space will be used. But, in this case, these arrays represented a small part of the total data for any problem size, so the decision was found to be sound. As described in Section 4.4, the nal step of altering the data distribution is to resolve all uses of shared scalar variables. One possibility here is to replicate computation in order to avoid further communications. In this case, the computations between parallelised loops constituted a small proportion of the entire run-time, and the shared scalars represented a constant amount of data, the same for any problem size. For these reasons, all computation undertaken by the master thread (after the initial setup phase) was duplicated on all threads. This resulted in approximately 70% of barriers which were placed earlier being removed. As mentioned in Section 4.2, all that was required after this to create the Bulk Synchronous Parallel (BSP) version of the code was to replace the explicit remote writes with calls to perform remote communication, and barrier synchronisation with calls to end the current superstep. This process resulted in the initial version of the BSP code. One of the key novel ideas of BSP, as presented in Valiant90], is that communications can be scheduled so that they overlap periods of local computation. To combine communications in the user's code would be to restrict the choices of the library as to when the communications could be performed, and could potentially

CHAPTER 5. DEVELOPMENT OF CODES

93

damage performance. Also, another attractive aspect of BSP is that direct communications can be performed without necessitating the use of bu ering, which is precisely what combining communications means introducing. Because of this, it is not appropriate to alter the communication patterns of the code further. However, there are several settings for the BSP library which can be set at run-time by the initialising call to the library. In order to improve the e ciency and scalability of the code, these parameters may be adjusted. Of these, only increasing the number of bu ers used per process for incoming messages helped execution times. The best increase was obtained by setting the number of bu ers to the number of processors. The BSP code, as initially run with the library's default settings, except for an increased bu er size to prevent run-time over ow errors, will be referred to as BSP-FIRST. Running the BSP code with the improved library settings will be referred to as BSP-TUNED.

5.5 Message Passing Interface Code


As mentioned in Section 4.2, the simplest way to obtain a Message Passing Interface (MPI) code, is to take the BSP code and alter it to make the communications two-sided. As stated in Section 3.3.1, the MPI standard contains higher level calls to perform reduction operations, as well as commonly used broadcast patterns, scatters, gathers, etc.. This enables some of the routines from the BSP code to be simpli ed, as the BSP library does not contain any such high level routines, with even simple reductions requiring manual handling of all communications. This led to the rst version of the MPI code, which will be referred to as MPI-FIRST. Unlike BSP, with MPI the user is completely responsible for communication patterns. As sending a message to another processor may have a high network startup cost associated with it (see Foster95], Section 3.3.1), the next step was

CHAPTER 5. DEVELOPMENT OF CODES

94

to combine communications wherever possible. The only place in the main code where there are su cient communications to warrant this approach is in the subroutine `dmdt'. To achieve the combining of communications, new data structures had to be introduced. The packing and unpacking of the data is performed in separate routines, to allow `dmdt' to remain as readable as possible (these routines are listed in Appendix A.5, after the code for `dmdt' itself). Although the number of communications was reduced greatly, there was additional computation performed in order to pack and unpack the data { in the scheme of overheads presented in Bull96], this is counted as overhead due to implementational changes. This new version of the MPI code will be referred to as MPI-COMMS.

5.6 Summary
With the methods to be evaluated, and the criteria by which to judge them, chosen, the development of the codes could now proceed. The serial code which was parallelised three times was introduced, and its basic function explained. The development of each code was dealt with in turn: shared memory, including details of the overhead analysis which was performed as part of the tuning process; BSP, including points of interest resulting from the use of the development method from Chapter 4; and nally MPI. In all cases, details of what tuning was performed were given, and the resulting code versions were all given systematic names. With the codes fully developed, they may now be measured according to the evaluation criteria. These measurements can then be turned into results, which will be presented in the next chapter.

Chapter 6 Results
With the codes developed and tuned, the nal versions of the codes are now available to be analysed for the maintainability criterion. The measured execution times can be used used to calculate e ciency values for the performance results, and these can be combined with information in the development diary to produce graphs for the `ease of use' results. All of these results are presented in this chapter. Firstly, the maintainability of the nal codes will be discussed. Readability will be considered rst, with the software metrics being used to reinforce the arguments made. A simple example code will be used to help discuss development potential, by considering how easy each version of the code is to alter. The portability of the nal code will also be discussed, with the reasons behind any lack of portability being carefully considered. Following this, the performance results will be presented for each platform, with inter-method comparisons being made where possible. The results from the overhead analysis proposed in Section 5.3 will also be presented here. The last set of results to be presented will be the ease of use results, after which, the chapter will be summarised. 95

CHAPTER 6. RESULTS

96

6.1 Maintainability
In Section 2.2, the three components of maintainability considered to be vital were stated as being readability, development potential, and portability. These components are dealt with in turn, and then the results are summarised.

6.1.1 Readability
As stated in Section 3.2, the shared memory code was programmed using compiler directives embedded in comments which were inserted into the program. These directives did have a negative e ect on readability, as can be seen by looking at the code in Appendix A.3, in that they can be large, breaking up the structure of the code. Despite this negative e ect, it is clear from the devectorised and shared memory versions of the subroutine `dmdt' which are given in Appendices A.2 and A.3, respectively, the only di erences between the two are these directives, making the shared memory code easily readable to anyone familiar with the devectorised code. Although the devectorised code does represent a considerable change from the serial code, this represents the starting point for all of the parallel codes, and so similarity to this code is still important, as this is as close to the serial code as the parallel codes can get. Examining the BSP and MPI codes, given in Appendices A.4 and A.5, respectively, it is apparent that there are many more changes to the devectorised source. These changes are mainly due to the requirement for communications to be coded explicitly. These observations are backed up by the software metrics given in Table 6.1, which clearly show the shared memory code to be overall the shortest nal code, and the one containing the least changes from the devectorised code, and hence the least changes from the serial code. The other important
1

An interesting observation from the table is that the devectorised code is shorter that the original code, suggesting that it should be more readable.
1

CHAPTER 6. RESULTS
Code Printable Source Lines New Data Modi ed Data Structures Version Characters Modi ed Structures Count Description Original 140986 N/A N/A N/A N/A Devectorised 140714 229 0 20 Array Indices Reversed Shared Memory 160060 536 0 0 N/A BSP 164744 1938 2 40 Arrays Distributed MPI 166215 1901 6 40 Arrays Distributed

97

Table 6.1: Software Metrics for Final Codes factor to consider is that the shared memory code requires no data structures to be either added or modi ed, whereas both the BSP and the MPI codes require both. Following the logic from Section 2.2, these factors lead us to the conclusion that the shared memory code is the most readable. In order to separate BSP and MPI, the number of data structures which were added must be considered. As stated in Section 2.2, these are important because they represent an algorithmic change to the program. Although alterations to data structures may sometimes represent algorithmic changes, here the data structure alterations only represent the distribution of the arrays. Here the MPI code required the largest number of new data structures { four more than BSP. The two new data structures required by both codes are required to keep track of which grain and level-one cell details must be sent where. Of the four additional structures required by MPI, two of these were required to keep track of incoming grain and level-one cell information (because of the two sided messaging), and the remaining two were required for the bu ering of messages. Like most of the changes to the code, these new data structures are all concerned with communications. Due to the extra data structures required by MPI, the conclusion here must be that the BSP code is more readable than the MPI code.

CHAPTER 6. RESULTS

98

6.1.2 Development Potential


In order to evaluate development potential for a code, a modi cation to a code must be considered. No modi cations were required to be made to the code which had been parallelised, and there was no perceived change which could be made to the code that would be simple enough to present and discuss fully. For these reasons, an example code, which computes the mean of an array of integers, is presented in three parallelised forms: shared memory in Figure 6.1, BSP in Figure 6.2 and MPI in Figure 6.3. It is possible that the user { a non-expert in parallel computing { would want to modify this routine to also calculate and display the standard deviation of the array, which if n elements large is given by the equation: s Pn i ? x) (6.1) SD = i (x p
2 =1 2

where x is the mean of the array. There is an alternative way of calculating the standard deviation by calculating the sum of the squares of the xi . This method has the advantage of requiring only one loop over the array, as it does not require the prior evaluation of the mean of the array. However, this method is far more prone to over ow, and loss of accuracy, due to the potential size of the sum of the squares, and so will not be used. The modi cation will now be considered for each method. In the case of shared memory, it is possible for the non-expert to simply add another loop to the code to evaluate the standard deviation, in the same way that they could add a loop to the serial code. While the new loop would not run in parallel, it would still run and return the correct answer. If this section of the code is not performance critical, it may be possible to leave the code as it is. Alternatively, if the performance of this section of the code is important, and the user cannot work out what the new directive should contain, then the expert will
2

It is interesting to note that this example con rms the readability results.

CHAPTER 6. RESULTS

99

subroutine stats(items,nitems) dimension items(nitems) integer items,nitems integer i,total real mean total=0 C$DOACROSS LOCAL(i),SHARE(nitems,items),REDUCTION(total) do 10 i=1,nitems total=total+items(i) 10 continue mean=real(total)/real(nitems) print*,'Mean = ',mean return end

Figure 6.1: First Shared Memory Version of Example Code

CHAPTER 6. RESULTS

100

subroutine stats(items,nitems) dimension items(nitems) integer items,nitems integer i,total,redtotal integer myid,nthreads,itile,idist,bsppid dimension redtotal(4) real mean nthreads=4 call bsppushregister(redtotal,nthreads*4) call bspsync() myid=bsppid() itile=(nitems+nthreads-1)/nthreads idist=min((myid+1)*itile,nitems)-myid*itile total=0 do 10 i=1,idist total=total+items(i) 10 continue call bsphpput(0,total,redtotal,4*myid,4) call bspsync() if(myid.eq.0) then do 20 i=2,nthreads total=total+redtotal(i) 20 continue mean=real(total)/real(nitems) print*,'Mean = ',mean endif return end

Figure 6.2: First BSP Version of Example Code

CHAPTER 6. RESULTS

101

subroutine stats(items,nitems) include 'mpif.h' dimension items(nitems) integer items,nitems integer i,total,mytotal integer myid,nthreads,itile,idist,ierr real mean call mpi_comm_size(MPI_COMM_WORLD,nthreads,ierr) call mpi_comm_rank(MPI_COMM_WORLD,myid,ierr) itile=(nitems+nthreads-1)/nthreads idist=min((myid+1)*itile,nitems)-myid*itile mytotal=0 do 10 i=1,idist mytotal=mytotal+items(i) 10 continue call mpi_reduce(mytotal,total,1,MPI_INTEGER,MPI_SUM, , 0,MPI_COMM_WORLD,ierr) if(myid.eq.0) then mean=real(total)/real(nitems) print*,'Mean = ',mean endif return end

Figure 6.3: First MPI Version of Example Code

CHAPTER 6. RESULTS
subroutine stats(items,nitems) dimension items(nitems) integer items,nitems integer i,total real mean,stddev total=0 C$DOACROSS LOCAL(i),SHARE(nitems,items),REDUCTION(total) do 10 i=1,nitems total=total+items(i) 10 continue mean=real(total)/real(nitems) print*,'Mean = ',mean total=0 C$DOACROSS LOCAL(i),SHARE(nitems,items),REDUCTION(total) do 20 i=1,nitems total=total+((items(i)-mean)**2) 20 continue stddev=sqrt(real(total)/real(nitems)) print*,'Standard Deviation = ',stddev return end

102

Figure 6.4: Altered Shared Memory Version of Example Code be required to supply the new directive, resulting in the code shown in Figure 6.4. The fact that the modi cation can be made so easily by the non-expert is a direct result of the contiguous memory map that the shared memory system provides, hiding all interprocessor communication from the user. In the case of BSP, the non-expert cannot simply add a loop which will execute on only one processor, as no one processor has access to all of the data. Instead, all the processes must be made to co-operate in the same way as when working out the mean. It might be possible for the non-expert to cut-and-paste the part

CHAPTER 6. RESULTS
subroutine stats(items,nitems) dimension items(nitems) integer items,nitems integer i,total,redtotal integer myid,nthreads,itile,idist,bsppid dimension redtotal(4) real mean,stddev nthreads=4 call bsppushregister(redtotal,nthreads*4) call bspsync() myid=bsppid() itile=(nitems+nthreads-1)/nthreads idist=min((myid+1)*itile,nitems)-myid*itile total=0 do 10 i=1,idist total=total+items(i) 10 continue do 15 i=0,nthreads-1 if (i.ne.myid) then call bsphpput(i,total,redtotal,4*myid,4) endif 15 continue call bspsync() do 20 i=1,nthreads if(i.ne.(myid+1))then total=total+redtotal(i) endif 20 continue mean=real(total)/real(nitems) if(myid.eq.0) then print*,'Mean = ',mean endif total=0 do 30 i=1,idist total=total+((items(i)-mean)**2) 30 continue

103

call bsphpput(0,total,redtotal,4*myid,4) call bspsync() if(myid.eq.0) then do 40 i=2,nthreads total=total+redtotal(i) 40 continue stddev=sqrt(real(total)/real(nitems)) print*,'Standard Deviation = ',stddev endif return end

Figure 6.5: Altered BSP Version of Example Code of the subroutine which evaluates the mean, and alter the copied loop to calculate the standard deviation instead, but, there is a further complication. In the code given in Figure 6.2, the value of the mean is only calculated on one processor. To expect the non-expert to recognise this fact, to alter the communication structure to send the data to all processes, and to alter the code to perform the reduction on all processes, is unreasonable. This becomes clearer when examining the nal code, given in Figure 6.5. With MPI, the same issues apply which applied to BSP: the data distribution means that a single process cannot evaluate the standard deviation, so the processes must co-operate; and the code in Figure 6.3 only calculates the mean on one process. However, when examining the nal code in Figure 6.6, it is evident that to alter the code to perform the evaluation of the mean on all processes is a

CHAPTER 6. RESULTS

104

trivial change as far as the user is concerned, due to the extensive variations on standard calls which are provided by the MPI library. Although it is still a lot to expect the non-expert to do, it is far less work than is required to alter the BSP code correctly. The single contiguous memory map that is provided by a shared memory system is a large advantage here, in that it allows the non-expert to modify code without any knowledge of parallel programming, meaning that shared memory codes yield the highest development potential. The only real di erence between the development potential of the BSP and MPI code was due to the current lack of functionality in the BSP library, rather than an inherent feature of the programming method. For this reason, it is not possible to conclude that MPI codes are easier for the non-expert to modify than BSP codes. An issue which this result has implications for, is the maintenance of multiple versions of the code. From Table 6.2, the improvement provided by the devectorised code is considerable. However, despite this potential for improvement, the serial code introduced in Section 5.1 was also being run on a workstation as well as a vector processor. The reason for doing this was that to maintain two separate versions of the code would be too expensive, an expense which the gain in performance did not justify. Consider the case where a serial version of the code has been retained by the non-expert for running on a workstation. This may be necessary for one of two reasons: 1. it may be needed to run small problems on, in the cases where time on the parallel machine is at a premium; or 2. it may be necessary to modify and experiment with a version of the code, requiring a rapid edit-compile-test development cycle, but the parallel machine may not permit interactive use, requiring instead that parallel jobs

CHAPTER 6. RESULTS

105

subroutine stats(items,nitems) include 'mpif.h' dimension items(nitems) integer items,nitems integer i,total,mytotal integer myid,nthreads,itile,idist,ierr real mean,stddev call mpi_comm_size(MPI_COMM_WORLD,nthreads,ierr) call mpi_comm_rank(MPI_COMM_WORLD,myid,ierr) itile=(nitems+nthreads-1)/nthreads idist=min((myid+1)*itile,nitems)-myid*itile mytotal=0 do 10 i=1,idist mytotal=mytotal+items(i) 10 continue call mpi_allreduce(mytotal,total,1,MPI_INTEGER,MPI_SUM, MPI_COMM_WORLD,ierr) mean=real(total)/real(nitems) if(myid.eq.0) then print*,'Mean = ',mean endif mytotal=0 do 20 i=1,idist mytotal=mytotal+((items(i)-mean)**2) 20 continue call mpi_reduce(mytotal,total,1,MPI_INTEGER,MPI_SUM, 0,MPI_COMM_WORLD,ierr) if(myid.eq.0) then stddev=sqrt(real(total)/real(nitems)) print*,'Standard Deviation = ',stddev endif return end

Figure 6.6: Altered MPI Version of Example Code

CHAPTER 6. RESULTS
are executed via a batch queue.

106

In the case of the shared memory code, there is no problem with multiple versions, as the parallel code is simply an annotated version of the serial code, which can be compiled, with no modi cations, on a workstation. As shown above, the shared memory code can be modi ed by the non-expert (on the workstation), and run on the shared memory machine. As already stated, if performance is a problem, then the expert may be required to parallelise any new loops. However, with BSP and MPI, the parallel code is completely distinct from the serial code. It would be possible to obtain a copy of the required library for a network of workstations and then run a `single processor' version of the MPI or BSP code, but, as shown above, this would make modi cation of the code practically impossible for the non-expert, which is certainly undesirable, and more likely unacceptable. The only other possibility is for the user to modify the serial code, and for the expert to either modify the parallel implementation, or to re-parallelise the modi ed serial code. However, neither of these options are as satisfactory as the situation with the shared memory code: with the former, the expert must understand the code, and the change required by the user, in su cient detail to allow the expert to modify the parallel code in the same way; with the latter, the expert must perform the whole process of parallelising the code from scratch; with the shared memory code, neither of these requirements are demanded of the expert.

6.1.3 Portability
The nal aspect of maintainability to be considered was that of portability. Because BSP and MPI are both library based implementations, the BSP and MPI codes become instantly portable to any platform which has the correct compiler, and a version of the library. This contrasts with the shared memory code, which needed a set of compiler directives for each of its target platforms. As mentioned

CHAPTER 6. RESULTS

107

in Section 3.2, there is no standard for these directives, and so new directives must be added for each new platform. However, as also pointed out in Section 3.2, there have been few attempts at standardisation of the directives, and although no standard has emerged, there is no evidence to suggest that such a standard is impossible. Until a standard does emerge, there is still work which can be done to improve the portability of shared memory codes. As noted in Section 3.2, the parallelising directives for the SGI Challenge and the KSR1 contain basically the same information. Extending this idea, it is easy to envisage an automated transformation of SGI directives to KSR1 directives. Such a scheme would not require any semantic knowledge of the code, and the only syntactic information it would require would be the location of the end of the loop which was to be parallelised, in order to place the `*ksr* end tile' directive. Such a task is by no means beyond the task of a precompiler. It is interesting to note that the transformation would require semantic knowledge about the program, as shared variables must be declared in the SGI Challenge directives, but are not listed in the KSR1 directives. Problems of this sort would be easier to avoid if a standard notation, which was su ciently expressive for any shared memory architecture, such as Fortran-S Bodin93] was adopted, and precompilers supplied for each target platform. Another point to make about portability is that shared memory codes are limited to those systems which have a facility for shared memory, whereas distributed memory codes using libraries such as MPI can be ported to almost any collection of processors which are networked together, including heterogeneous networks. However, as pointed out in Chapter 3, the multiprocessor and the multicomputer can emulate each other, and so the lack of portability to certain classes of machines is not an inherent feature of the shared memory model.

CHAPTER 6. RESULTS

108

Although important, portability is often not considered to be as important as other issues. If a code must be run according to some constraints, it is likely that a suitable architecture will be sought out, and the code ported to this architecture. After this, the code will be happily run on this platform until the constraints change. In short, porting is a necessary, but infrequent activity { it is not something which is continuously required.

6.1.4 Overview
To sum up, the shared memory code outperforms the other methods' codes signi cantly on the two most important aspects of maintainability, i.e. readability and development potential. Although both BSP and MPI codes are currently more portable than the shared memory code, this may well change in the future.

6.2 Performance
Before the results for performance are presented, some details about how the results were collected will be given.

6.2.1 Practical Issues


In order to compare e ciency between di erent methods fairly, the same kinds of results must be taken under as similar conditions as possible. For this study, measurements were taken over the same range of problem sizes on each of the target architectures, of which there were two for each parallelising method. As stated in Section 5.1, if the problem sizes get too big, then the code will not run without alterations. To make the results meaningful, a range of problem sizes was chosen which avoided the problem of having to tamper with the model parameters, and used only problem sizes in the range where the serial code was

CHAPTER 6. RESULTS

109

still yielding execution times of around O(NlogN ). The range of problem sizes used was square sizes between 4x4 and 8x8, inclusively. When collecting results, it was found that graphs for these ve problem sizes contained too much repetitive information. For this reason, results will only be given for problem sizes 4x4, 7x7 and 8x8. This way, information about the largest and smallest problem sizes are shown, as well as results for a problem size which does not easily divide onto several processors. Initially, execution times were measured for an entire hysteresis loop. For a serial code running problem size 4x4, the execution time on the SGI Challenge is around six hours. To measure e ciencies for di erent problem sizes on di erent platforms for the three methods will take many individual executions of the codes, and so execution times of this duration are impractical. Apart from the physical time constraint, there are other practicalities, such as other users requiring use of the machines, and a limited allocation of CPU hours (as was the case for the Cray T3D). Pro ling the code on the SGI Challenge machine revealed that executions of only a few iterations of the loop provide representative use of the code, apart from the setup routines which became far more signi cant in such short runs. Execution times were therefore measured over four steps of the hysteresis loop. There is also an issue concerning which four loop iterations to time. The magnetisable grains are at rst given a magnetic eld with a random orientation. This means that the rst step in the hysteresis loop does far more work than later steps, as the grains are aligned for the rst time. This rst step also uses many arrays for the rst time, which incurs overheads as pages must be allocated by the virtual memory system and then mapped into main memory. These overheads become especially deceptive when attempting overhead analysis on a parallel version of the code, as they can masquerade as load imbalance. To avoid this, the rst two steps were discarded, and execution times were measured over steps

CHAPTER 6. RESULTS

110

three to six, inclusive. Most platforms provide a way of measuring the CPU usage of a program accurately. This is a good way of accurately measuring execution times of serial codes, as it re ects what the execution time would be on a quiet system. However, for a parallel program, there are times when some threads of the program will be idle, waiting for other threads to reach a synchronisation point in the code, or waiting for a message to arrive. Simply measuring the total CPU usage and dividing by the number of threads produces misleading results, as the stall time for each thread may be due to di erences in the load on each processor, or to unparallelised sections of code which execute on one thread only. Hence, to measure parallel execution times accurately, a high resolution timer which measures actual elapsed time must be used, on quiet processors. The requirement for a quiet machine, or at least for a quiet set of processors, is important. Of the three target platforms, this presented the most di culty on the SGI Challenge, which provides no way for reserving processors exclusively. As the machine also has only four processors, they are all in use whenever more than four user processes are running on the machine. The only solution is to use the machine when there are no other users or jobs in progress. The KSR1 has a non-standard program which allocates processors exclusively, so that no other user processes can execute on the allocated processors. This is an important advantage, as accurate results can be measured, even when the system is being used by other users. Unfortunately, it is still possible for system processes to `over ow' onto the allocated processors if the system is busy with most of the processors allocated. In the case of the Cray T3D, programs are run on an allocated processor set, which always consists of 2n processors, where n is a non-negative integer. Most aspects of system management are performed on a host machine, in this

CHAPTER 6. RESULTS

111

case a Cray Y-MP, which is responsible for processor allocation, queueing jobs for execution, loading executables onto the T3D cells, and handling I/O to the T3D. To alleviate the load on the Y-MP, a Cray J90 was used to cross-compile the executables for the T3D. This decoupling seems awkward, but it allows the processing elements of the T3D to be completely devoted to running parallel programs. The individual nodes run a minimal operating system which mainly handles memory allocation. The exclusive use of the processors allows accurate measurements of execution times to be taken, even on a busy machine. Shortly after starting to measure execution times, it became clear that, even under ideal circumstances, the results were not always accurate. On the SGI Challenge and KSR1, processing may have been interrupted by certain background system processes, providing arti cially long execution times. Even on the Cray T3D, measured execution times were sometimes too large, occasionally twice as long as they should be. The only feasible explanation for this is the network becoming saturated as another program writes large amounts of data back to the Y-MP, e ectively halting execution for a period. To avoid results being distorted, execution times were measured for three runs of the same program. On the SGI Challenge and the Cray T3D, the execution times would typically be within 0.1s of each other. When there was distortion, two of the runs would be virtually the same, with the other being signi cantly, i.e. more than 0.1s, di erent. To discount the distorted time, the fastest execution time was used. In any degenerate cases, where three signi cantly di erent results were returned, the same program was run again three more times. In the case of the KSR1, execution times were not typically as close to each other. Ideally, several of the execution times would have been measured again to remove some of the more unruly kinks from the graphs, but, due to the eventual failure of the machine, this was not possible.

CHAPTER 6. RESULTS
Problem Size Execution Times (s) Devectorised Code Original Code SGI KSR1 T3D SGI T3D 9.9 16.1 9.4 10.4 13.4 34.6 62.9 25.2 36.1 48.1 53.1 134.8 69.1 69.0 73.5

112

4x4 7x7 8x8

Table 6.2: Execution Times for Serial Codes As stated in Section 2.3, e ciency is measured by comparing an actual execution time with an ideal execution time for the same sized problem on the same number of processors. For a given problem size, ideal times are based on the execution time of the devectorised version of the code, denoted Tseq , and are generated by dividing this time by the number of processors. Execution times for the devectorised code on each of these problem sizes, on each of the target platforms, are shown in Table 6.2. To show the di erence that the devectorising process made in terms of performance, the execution times for the original serial code on the SGI Challenge and Cray T3D are also listed { no measurements were available for the KSR1. The e ciency of a given code for a particular problem size and number of processors may now be calculated, given its execution time, by applying Equaseq . tion 2.2, which states that E (p) = pTT par p In the following sections, the notation introduced in Chapter 5 will be used to distinguish between code versions. Here, the notation will be extended to include the platform and problem size, so for example, the code SM-PMON running on the SGI Challenge and solving problem size 8x8 will become SM-PMON-SGI-8x8, and so on. The performance results are now presented for each platform.
( )

CHAPTER 6. RESULTS

113

6.2.2 SGI Challenge


As described in Section 5.3, there were four versions of the shared memory code, namely SM-ONEP, SM-ALLP, SM-ELOG and SM-PMON. The execution times of these codes were measured, and e ciency was plotted versus the number of processors used, for the problem sizes 4x4, 7x7 and 8x8. These graphs are shown in Figure 6.7. The execution times themselves are tabulated in Appendix B.1.1. After the production of the SM-ALLP code, some overhead analysis was performed. The method for this analysis, as detailed in Section 5.3, was based on Bull96]. By a process of elimination, the only overheads relevant to this code were: 1. the cost of unparallelised code, 2. the cost of using loop-parallelising compiler directives, 3. the cost due to load imbalance, and 4. memory access overheads. Unfortunately, the SGI Challenge provides no tools for measuring or examining load imbalance or memory access overheads, and so only the cost of unparallelised code, and the cost of the parallelising compiler directives could be measured. The analysis was carried out on the three problem sizes 4x4, 7x7 and 8x8, using the four-processor times in each case. The amount of time spent in parallel loops was measured, along with the number of parallelised loops which were executed. These allow us to calculate the amount of time spent in unparallelised code, which is pure overhead; and also the overhead associated with starting and synchronising the parallel loops (using the data given in Table 3.2). When measuring the amount of time spent in unparallelised code, the parallel code was run on one processor only. This

CHAPTER 6. RESULTS

114

1.4 1.2 1 Efficiency 0.8 0.6 0.4 0.2 0 0 1

Naive ideal SM-ONEP-SGI-4x4 SM-ALLP-SGI-4x4 SM-ELOG-SGI-4x4 SM-PMON-SGI-4x4

Problem Size 4x4 on the SGI Challenge


1.4 1.2 1 Efficiency 0.8 0.6 0.4 0.2 0 0 1 Naive ideal SM-ONEP-SGI-7x7 SM-ALLP-SGI-7x7 SM-ELOG-SGI-7x7 SM-PMON-SGI-7x7

2 3 Processors

Problem size 7x7 on the SGI Challenge


1.4 1.2 1 Efficiency 0.8 0.6 0.4 0.2 0 0 1 2 3 Processors 4 5 Naive ideal SM-ONEP-SGI-8x8 SM-ALLP-SGI-8x8 SM-ELOG-SGI-8x8 SM-PMON-SGI-8x8

2 3 Processors

Problem size 8x8 on the SGI Challenge

Figure 6.7: E ciency Graphs for the Shared Memory Codes on the SGI Challenge

CHAPTER 6. RESULTS
Problem Sizes Measurement 4x4 7x7 8x8 Number of Parallel Loops 647 599 608 Overhead per Loop (ms) 0.017 0.017 0.017 Parallel Loop Overhead (ms) 11 10 10

115

Table 6.3: Parallel Loop Overheads on the SGI Challenge


Problem Sizes Measurement 4x4 7x7 8x8 Actual Execution Time (s) 2.753 9.500 14.455 Ideal Execution Time (s) 2.480 8.650 13.300 Overheads (ms) 273 850 1155 Unparallelised Code Execution (ms) 59 59 79 Parallel Loop Overhead (ms) 11 10 10 Unaccounted for Overhead (ms) 203 781 1066

Table 6.4: Overhead Analysis on SGI Challenge ensured that the time measured is not distorted by accesses to data contained in other processors' caches which may occur in the unparallelised regions of the code. The parallel loop overheads are shown in Table 6.3, and the overhead analysis, based on the SM-ALLP code, is shown in Table 6.4. As already mentioned, a fuller analysis was not possible on the SGI Challenge. The later versions of code, SM-ELOG and SM-PMON, were generated as a result of the more in-depth analysis that was carried out on the KSR1, the results from which are given in Section 6.2.3. As stated in Section 5.4, there was only one version of the BSP code, but it was run with the default settings of the library as BSP-FIRST, and with the tuned settings of the library as BSP-TUNED. As with the shared memory code, e ciencies were calculated from the measured execution times (tabulated in Appendix B.1.2) for di erent numbers of processors. The graphs of these measurements are shown

CHAPTER 6. RESULTS

116

in Figure 6.8. The BSP code performs badly, with the tuned library version running only marginally faster on four processors than on three for the 4x4 problem size. As mentioned in Section 3.5.1, the BSP implementation chosen for the SGI Challenge is ine cient, which explains the poor results. Even for the 8x8 problem size, extrapolating from the e ciency graphs suggests that the code will not scale well above four processors. However, the BSP-TUNED code does exhibit superlinear speedup for problem size 8x8 when using two processors. From the classi cation of overheads presented in Bull96], and outlined in Section 5.3, the only overhead which may be negative in this case is the memory access overhead. This must have been due to the increase in available cache, as no more main memory is available to two processors than is available to one. However, there are still two possible sources of the improvement { the increase in level-two cache from 1Mb to 2Mb, and the increase in level-one cache from 16kb to 32kb. As no hardware support is available for counting cache and subcache hits and misses, it was not possible to establish precisely what causes this superlinear speedup. The two versions of the MPI code, MPI-FIRST and MPI-COMMS were developed as described in Section 5.5. Execution times were measured for these codes, and e ciency values calculated from these times. The times themselves are tabulated in Appendix B.1.3. Graphs of e ciency varying with the number of processors are shown in Figure 6.9. As with the BSP-TUNED code, superlinear speedup was achieved by the MPI-COMMS code for problem size 8x8 on one to four processors, with the marked increase in performance being observed between one and two processors. As argued for the BSP code, this increase may be due either to the increase of level-two cache from 1Mb to 2Mb, or the increase of level-one cache from 16kb to

CHAPTER 6. RESULTS

117

1.4 1.2 1 Efficiency 0.8 0.6 0.4 0.2 0 0 1

Naive ideal BSP-FIRST-SGI-4x4 BSP-TUNED-SGI-4x4

Problem Size 4x4 on the SGI Challenge


1.4 1.2 1 Efficiency 0.8 0.6 0.4 0.2 0 0 1 Naive ideal BSP-FIRST-SGI-7x7 BSP-TUNED-SGI-7x7

2 3 Processors

Problem size 7x7 on the SGI Challenge


1.4 1.2 1 Efficiency 0.8 0.6 0.4 0.2 0 0 1 2 3 Processors 4 5 Naive ideal BSP-FIRST-SGI-8x8 BSP-TUNED-SGI-8x8

2 3 Processors

Problem size 8x8 on the SGI Challenge

Figure 6.8: E ciency Graphs for the BSP code on SGI Challenge

CHAPTER 6. RESULTS

118

1.4 1.2 1 Efficiency 0.8 0.6 0.4 0.2 0 0 1

Naive ideal MPI-FIRST-SGI-4x4 MPI-COMMS-SGI-4x4

Problem Size 4x4 on the SGI Challenge


1.4 1.2 1 Efficiency 0.8 0.6 0.4 0.2 0 0 1 Naive ideal MPI-FIRST-SGI-7x7 MPI-COMMS-SGI-7x7

2 3 Processors

Problem size 7x7 on the SGI Challenge


1.4 1.2 1 Efficiency 0.8 0.6 0.4 0.2 0 0 1 2 3 Processors 4 5 Naive ideal MPI-FIRST-SGI-8x8 MPI-COMMS-SGI-8x8

2 3 Processors

Problem size 8x8 on the SGI Challenge

Figure 6.9: E ciency Graphs for the MPI code on SGI Challenge

CHAPTER 6. RESULTS

119

32kb. Although there is no way to discover which of these is responsible for the result, in Section 6.2.4, the availability of 32kb of cache is seen to be signi cant on the Cray T3D, and therefore is likely to be the signi cant factor on the SGI Challenge. To allow a comparison of the di erent methods, a further set of graphs was plotted, showing the best versions of each type of code, i.e. SM-PMON, BSPTUNED and MPI-COMMS, on the same graph. These graphs are shown in Figure 6.10. The SM-PMON is generally slightly slower than the MPI-COMMS code, but it does perform better for problem size 4x4 on three processors. This suggests that the method of distributing grains and level-one cells in groups determined by the distribution of level-two cells creates signi cant load imbalance overheads under certain circumstances. This can be understood better by considering how the problem size 4x4, which contains 16 level-two cells, would be distributed for four processors. A tile size is chosen which covers the level-two cells, which will be 6. This leads to two processors handling data for 6 level-two cells, and the fourth handling data for only 4. It would be possible to change the MPI and BSP codes to distribute the level-one cells and individual grains in di erent ratios to the level-two cells, but, as stated in Section 5.4, this would greatly complicate the code.

6.2.3 KSR1
As stated in Section 5.3, the KSR1 provides tools suitable for overhead analysis. The method for analysis suggested there, based on Bull96] was followed, and the results are presented below. The rst stage of overhead analysis began after the production of the SM-ALLP code, in which all suitable loops were parallelised. Of the overheads described in the classi cation scheme presented in Bull96],

CHAPTER 6. RESULTS

120

1.4 1.2 1 Efficiency 0.8 0.6 0.4 0.2 0 0 1

Naive ideal SM-PMON-SGI-4x4 BSP-TUNED-SGI-4x4 MPI-COMMS-SGI-4x4

Problem Size 4x4 on the SGI Challenge


1.4 1.2 1 Efficiency 0.8 0.6 0.4 0.2 0 0 1 Naive ideal SM-PMON-SGI-7x7 BSP-TUNED-SGI-7x7 MPI-COMMS-SGI-7x7

2 3 Processors

Problem size 7x7 on the SGI Challenge


1.4 1.2 1 Efficiency 0.8 0.6 0.4 0.2 0 0 1 2 3 Processors 4 5 Naive ideal SM-PMON-SGI-8x8 BSP-TUNED-SGI-8x8 MPI-COMMS-SGI-8x8

2 3 Processors

Problem size 8x8 on the SGI Challenge

Figure 6.10: Comparative E ciency Graphs for the SGI Challenge

CHAPTER 6. RESULTS
as used in Section 5.3, only four overheads were found to be relevant: 1. the cost of unparallelised code, 2. the cost of using loop-parallelising compiler directives, 3. the cost due to load imbalance, and 4. memory access overheads.

121

The analysis was based on the three problem sizes 4x4, 7x7 and 8x8, but with di erent numbers of processors being used for each problem size { 4 processors for 4x4, 8 for 7x7 and 16 for 8x8. These three combinations were used, because there is a substantial di erence between the actual and ideal execution times for each. The amount of time spent in parallel loops was measured, along with the number of parallelised loops which were executed. These allow us to calculate the amount of time spent in unparallelised code, which is pure overhead; and also the overhead associated with starting and synchronising the parallel loops (using the data given in Table 3.3). When measuring the amount of time spent in unparallelised code, the parallel code was run on one processor only. This ensures that the time measured is not distorted by accesses to data which is held in other processors' caches, which may occur in the unparallelised sections of the code. The parallel loop overheads which were measured are shown in Table 6.5. The overhead analysis based on the SM-ALLP code is shown in Table 6.6. As can be seen from the table, some of the overhead is still unaccounted for. As stated in Section 5.3, load imbalance is another possible source of overhead { ideal execution times rely upon a perfect division of work between processors, so, if the work in a parallelised loop is not well distributed, then some processors will be idle waiting for the other processors to nish. To detect load imbalance, the ELOG and `Gist' tools were used to gather data and then visualise it.

CHAPTER 6. RESULTS

122

Problem Sizes Measurement 4x4 7x7 8x8 Number of Parallel Loops 615 595 621 Number of Processors 4 8 16 Overhead per Loop (ms) 1.08 1.22 1.36 Parallel Loop Overhead (ms) 664 726 845

Table 6.5: Parallel Loop Overheads on the KSR1

Problem Sizes Measurement 4x4 7x7 8x8 Actual Execution Time (s) 5.31 9.36 10.17 Ideal Execution Time (s) 4.01 7.86 8.43 Overheads (ms) 1300 1500 1740 Unparallelised Code Execution (ms) 90 170 270 Parallel Loop Overhead (ms) 664 726 845 Unaccounted for Overhead (ms) 546 604 625

Table 6.6: Overhead Analysis on the KSR1

CHAPTER 6. RESULTS

123

In the analysis of the SM-ALLP code, no systematic load imbalance was found. However, a seemingly random imbalance was occuring on a regular basis, with one thread taking almost twice as long as the others. With some more events added, this was found to occur in the rst three loops of the subroutine `dmdt'. On closer examination, it became apparent that this `imbalance' was actually delay caused by the virtual memory system allocating memory for local arrays which are accessed for the rst time. These arrays can be seen in the code in Appendix A.3 { they are dmx, dmy and dmz for the rst loop, dm1x, dm1y and dm1z for the second, and dm2x, dm2y and dm2z for the third. To avoid these arrays being re-allocated, they can be moved to the common block. This is an area of a Fortran program for data structures that are shared by more than one subroutine. Such data structures are allocated static space at compile-time, so the virtual memory system will allocate pages for them only once. At the same time these arrays were moved, a visual search of the code for similarly problematic arrays was made. This turned up only one, less signi cant array which was being initialised once for every step of the hysteresis loop. This array was also moved into the common block. These changes led to the SM-ELOG code. ELOG analysis of this code showed no large imbalances in the code. There were still some small load imbalances, but, as seen in Section 5.3, accounting for these numerically is a di cult task which is not relevant to this evaluation. As stated in Section 5.3, the only remaining source of overhead for the shared memory code would be time spent waiting for memory accesses. The PMON (Performance MONitor) tool on the KSR1 was used to measure the time spent waiting for memory accesses on the master thread's processor. This stall time is used to indicate a general improvement in the memory access overhead. Initial measurements of this statistic for the selected problems sizes are shown in the rst line of Table 6.7.

CHAPTER 6. RESULTS
Version of Code SM-ELOG Alteration 1 Alteration 2 Alteration 3 Alteration 4 Problem Sizes and Measurements 4x4 7x7 8x8 Stall Execution Stall Execution Stall Execution Time (s) Time (s) Time (s) Time (s) Time (s) Time (s) 2.34 5.12 3.67 8.54 5.18 8.92 2.16 4.90 4.02 8.83 5.48 9.10 2.21 5.03 3.77 8.40 5.26 8.77 2.19 4.95 4.06 8.72 5.46 9.05 2.18 4.90 4.52 9.54 5.47 8.97

124

Table 6.7: PMON Measurements and Execution Times for Codes on the KSR1 To make the shared memory code more e cient, memory access patterns must be improved. Expensive memory accesses, where copies of the data exist in other processors' memory, should be avoided by ensuring that data is always accessed on the same processor. Also, cache use can be improved by reorganising arrays so that spatial locality is increased. An important part of the code is the generation of the level-one cell and level-two cell dipole forces for direct particlecell interactions. For a level-one cell, this calculation involves contributions from all grains contained within the cell. Similarly, for a level-two cell, contributions are required from all level-one cells contained within it. Taking this information into account, two initial directions for improvements were devised: 1. loops in the ODE solver code are not over the same size of arrays as the rest of the code { change these so that each thread in the solver code is accessing the elements of these arrays which are local to them; 2. arrays storing information about the magnetisable grain, the level-one cells, and the level-two cells are all ordered in raster-scan fashion. Reorder these so that grains 1 to 48 are contained in level-one cells 1 to 9, which are contained in level-two cell number 1. Following from an implementation of the second of these ideas, two further directions became available:

CHAPTER 6. RESULTS

125

3. alter the ordering of the level-two cells so that subdividing the problem area amongst the threads yields areas resembling the original problem area; 4. alter the division of iterations in loops over grain and level-one cell informations arrays so that all grains and level-one cell data which is contained within a level-two cell is processed by the same thread This last idea attempts to improve memory access times by sacri cing the load balance to some extent. The above codes were all implemented, and PMON data and execution times were measured for each. These gures are shown, along with similar data for the original SM-ELOG code, in Table 6.7. Note that the data for the third and fourth alterations should be judged relative to the data for the second, on which they were based. From this information, the second alteration was the only one which provided a consistent improvement for all problem sizes and numbers of processors. While none of the other alterations served to damage execution times, the decision was made to use the second alteration as the nal version of the code, SM-PMON, as this version provided the same performance with fewer changes to the code. The execution times of the four shared memory codes SM-ONEP, SM-ALLP, SM-ELOG and SM-PMON were measured on one to twenty four processors, for the problem sizes 4x4, 7x7 and 8x8. From these execution times, which are tabulated in Appendix B.2.1, the e ciency of the codes was calculated, and then for each problem size, e ciency was plotted against the number of processors. These graphs are shown in Figure 6.11. From these graphs, superlinear speedup can be seen (when e ciency is greater than one) for problem size 7x7 on 2 to 3 processors, and more noticeably for problem size 8x8 on 2 to 12 processors. For these problem sizes, it is clear that the code's e ciency increases sharply between one and two processors, for all

CHAPTER 6. RESULTS

126

1.4 1.2 1 Efficiency 0.8 0.6 0.4 0.2 0 0 5 10

Naive ideal SM-ONEP-KSR-4x4 SM-ALLP-KSR-4x4 SM-ELOG-KSR-4x4 SM-PMON-KSR-4x4

Problem Size 4x4 on the KSR1


1.4 1.2 1 Efficiency 0.8 0.6 0.4 0.2 0 0 5 10 Naive ideal SM-ONEP-KSR-7x7 SM-ALLP-KSR-7x7 SM-ELOG-KSR-7x7 SM-PMON-KSR-7x7

15 Processors

20

25

30

Problem size 7x7 on the KSR1


1.4 1.2 1 Efficiency 0.8 0.6 0.4 0.2 0 0 5 10 Naive ideal SM-ONEP-KSR-8x8 SM-ALLP-KSR-8x8 SM-ELOG-KSR-8x8 SM-PMON-KSR-8x8

15 Processors

20

25

30

Problem size 8x8 on the KSR1

15 Processors

20

25

30

Figure 6.11: E ciency Graphs for the Shared Memory Codes on the KSR1

CHAPTER 6. RESULTS

127

versions of code except SM-ONEP. In some cases, the increase in e ciency is large enough to yield an e ciency of more than one. As stated in Section 5.3, superlinear speedup occurs when the overall sum of overheads is negative; and the only overhead which can be negative is that of memory access cost. In this case, the memory required by problems of size 7x7 and 8x8 is too large to hold in the level-two cache (main local memory) of one processor, i.e. 32Mb (of which around 25Mb is available for user code). The jump in performance occurs between one and two processors because the memory required by both 7x7 and 8x8 problem sizes is small enough to t into the level-two cache of two processors. As discussed in Section 5.3, the improvement is large because page misses are all but eliminated from the code. As the number of processors increases beyond two, the other temporal overheads increase, eventually becoming too large for the negative memory access overhead to cancel them out, causing the code's e ciency to fall back below one. The number of page misses is just one of the statistics measured by the PMON tool on the KSR1. To verify this explanation of superlinear speedup, the number of page misses which occur on the master thread's processor when running on one, two, three and four processors are shown in Table 6.8 for the nal version of the code on problem sizes 4x4, 7x7 and 8x8. The memory required by the 4x4 problem size code is less than a quarter of that of the 8x8 code, and so ts easily into the level-two cache of one processor. For this reason, no signi cant amount of page misses occur for that problem size on one processor, and there is no superlinear speedup.

6.2.4 Cray T3D


As stated in Chapter 3, the BSP and MPI codes could be run on the Cray T3D, but not the shared memory code. Firstly, the results for the BSP code are

CHAPTER 6. RESULTS
Number of Problem Sizes Processors 4x4 7x7 8x8 1 9 12041 30246 2 3 391 401 3 6 16 48 4 18 13 3

128

Table 6.8: Page Miss Statistics for the Shared Memory Code on the KSR1 presented, using the default library settings (BSP-FIRST), and with the tuned library settings (BSP-TUNED). Execution times were measured for both of these `versions', and are tabulated in Appendix B.3.1. These times were used to calculate the e ciency of the code with each of the library settings, and graphs of e ciency versus the number of processors were plotted. These graphs are shown in Figure 6.12. As can be seen from these graphs, there were no times for running the BSP code on two processors for problem size 8x8. This was due to memory over ow. As the BSP code ran on one processor, this result seems contradictory. However, unlike the two-processor version of the code, in the single-processor version, no BSP communications are made, as no thread ever uses BSP library calls to communicate data to itself. This suggests that the BSP library is dynamically allocating large amounts of memory which causes the two-processor version of the code to over ow the available memory. This is particularly disappointing as the BSP library does insist that communication areas are given a xed size, so that they may be allocated once at the start of the program. This implies that little or no dynamic memory allocation should be required. Hopefully future versions of the BSP library will not be so expensive. As on the SGI Challenge, the BSP-TUNED code exhibited superlinear
3

As stated in Section 3.5.3, on the T3D, the program and its data must t within the local memory of the node.
3

CHAPTER 6. RESULTS

129

1.4 1.2 1 Efficiency 0.8 0.6 0.4 0.2 0 0 10 20

Naive ideal BSP-FIRST-T3D-4x4 BSP-TUNED-T3D-4x4

Problem Size 4x4 on the T3D


1.4 1.2 1 Efficiency 0.8 0.6 0.4 0.2 0 0 10 20 Naive ideal BSP-FIRST-T3D-7x7 BSP-TUNED-T3D-7x7

30 40 Processors

50

60

70

Problem size 7x7 on the T3D


1.4 1.2 1 Efficiency 0.8 0.6 0.4 0.2 0 0 10 20 Naive ideal BSP-FIRST-T3D-8x8 BSP-TUNED-T3D-8x8

30 40 Processors

50

60

70

Problem size 8x8 on the T3D

30 40 Processors

50

60

70

Figure 6.12: E ciency Graphs for the BSP Codes on the Cray T3D

CHAPTER 6. RESULTS

130

speedup for problem size 8x8. Whereas the superlinear speedup was observed on the SGI Challenge for two processors, here it was observed for four to sixteen processors. Unlike the superlinear speedup experienced on the KSR1 for the shared memory code, on the T3D the improvement cannot be due to eliminating page misses, as the T3D does not swap to disk. Again using the classi cation of overheads presented in Bull96] and outlined in Section 5.3, the only overhead which may be negative in this case is the memory access overhead. On the T3D, the only memory resources which increase are the local memory and the cache. As the local memory was ruled out, this leaves the increase in total available cache from 8kb (one processor) to 32kb (four processors). This could a ect performance if some frequently used part of the problem data ts within 32kb, but not 8kb. Unfortunately, there is no support on the T3D for measuring statistics such as cache hits and misses, so there was no way to verify this hypothesis directly. The only way to proceed further would be to measure the memory usage of important routines, such as `dmdt', which is responsible for 80% of the execution time { it may be the working set for this routine which ts into the 32kb of cache. To investigate this accurately was judged to be insu ciently relevant to the purpose of this thesis. The two versions of the MPI code, MPI-FIRST and MPI-COMMS were run for the problem sizes 4x4, 7x7 and 8x8 on one through to 64 processors. The measured execution times, which are tabulated in Appendix B.3.2, were used in the calculation of the codes' e ciency for particular problem size and <number of processors> combinations. Graphs of e ciency versus the number of processors were plotted, and these are shown for the three problem sizes in Figure 6.13. As with the BSP-TUNED code, superlinear speedup was achieved for problem size 8x8 on one to thirty-two processors, with the upward leap in e ciency again happening at four processors. As argued for the BSP code, this improvement must

CHAPTER 6. RESULTS

131

1.4 1.2 1 Efficiency 0.8 0.6 0.4 0.2 0 0 10 20

Naive ideal MPI-FIRST-T3D-4x4 MPI-COMMS-T3D-4x4

Problem Size 4x4 on the T3D


1.4 1.2 1 Efficiency 0.8 0.6 0.4 0.2 0 0 10 20 Naive ideal MPI-FIRST-T3D-7x7 MPI-COMMS-T3D-7x7

30 40 Processors

50

60

70

Problem size 7x7 on the T3D


1.4 1.2 1 Efficiency 0.8 0.6 0.4 0.2 0 0 10 20 Naive ideal MPI-FIRST-T3D-8x8 MPI-COMMS-T3D-8x8

30 40 Processors

50

60

70

Problem size 8x8 on the T3D

30 40 Processors

50

60

70

Figure 6.13: E ciency Graphs for the MPI Codes on the Cray T3D

CHAPTER 6. RESULTS

132

have been caused by the increase of cache available from 16kb on two processors to 32kb on four processors. To help compare the performance of BSP and MPI, a further set of graphs was plotted, comparing the e ciencies of the BSP-TUNED and MPI-COMMS codes. These graphs were plotted for problem sizes 4x4, 7x7 and 8x8, and are shown in Figure 6.14. MPI was clearly more e cient than BSP, and also scaled better. Interestingly, the shapes of the e ciency curves were strikingly similar for the two implementations.

6.2.5 Overview
The results for performance are most useful on the SGI Challenge, and the Cray T3D, where they allow di erent parallelisation methods to be compared. Here, they have demonstrated that both MPI-COMMS, and, on the SGI Challenge, SM-PMON, perform considerably better than BSP-TUNED. This is likely to be due to the early release of the WWBSP library which was used, rather than any inherent problems with BSP itself. The di erence in performance between MPI-COMMS and SM-PMON was insu cient to draw conclusions from.

6.3 Ease of Use


As stated in Section 2.4, ease of use is concerned with how quickly a parallelising method can achieve an e cient code. As suggested, a diary of development activities was kept, with the contribution of each task to each version of code being recorded. A summary of the diary is shown in Table 6.9. As stated in Section 2.4, the e ciency of each code was measured as intervals throughout the development cycle. These have been plotted in a series of

CHAPTER 6. RESULTS

133

1.4 1.2 1 Efficiency 0.8 0.6 0.4 0.2 0 0 10 20

Naive ideal BSP-TUNED-T3D-4x4 MPI-COMMS-T3D-4x4

Problem Size 4x4 on the Cray T3D


1.4 1.2 1 Efficiency 0.8 0.6 0.4 0.2 0 0 10 20 Naive ideal BSP-TUNED-T3D-7x7 MPI-COMMS-T3D-7x7

30 40 Processors

50

60

70

Problem size 7x7 on the Cray T3D


1.4 1.2 1 Efficiency 0.8 0.6 0.4 0.2 0 0 10 20 30 40 Processors 50 60 70 Naive ideal BSP-TUNED-T3D-8x8 MPI-COMMS-T3D-8x8

30 40 Processors

50

60

70

Problem size 8x8 on the Cray T3D

Figure 6.14: Comparative E ciency Graphs for the Cray T3D

CHAPTER 6. RESULTS

134

Description Number Contribution to Versions Of Task Of Days SM BSP MPI Examining Code 10 10 10 10 Gathering Serial Results 10 Devectorising Code 5 5 5 5 Parallelising dmdt Routine 2 2 Gathering Results 5 Parallelise Remaining Loops 5 5 Gathering Results 5 ELOG Analysis and Code Changes 1 1 1 1 Gathering Results 1 PMON Analysis and Code Changes 15 15 15 15 Gathering Results 2 Implementing Threads and Barriers Version 15 15 15 Duplicating Arrays (One Copy per Thread) 4 4 4 Switch to Local Numbering System 5 5 5 Making all Array Reads Local 3 3 3 Making all Scalar Reads Local 3 3 3 Implementing BSP Version 3 3 3 Gathering Results 10 Tuning BSP Library 1 1 1 Gathering Results 10 Implementing Initial MPI Code 3 3 Gathering Results 10 Improving MPI Communications 1 1 Gathering Results 10 Total Development Time 139 38 65 69

Table 6.9: Development Cycle Diary

CHAPTER 6. RESULTS

135

graphs, for the same problem sizes used in Section 6.2, namely 4x4, 7x7 and 8x8. Changing e ciencies for a particular code on di erent numbers of processors and di erent machines are plotted on the same graph, so the notation `p< n >-<platform>' is used to label the lines, where n is the number of processors. In all the graphs, the number of days represent time spent developing codes, ignoring time spent performing background tasks such as gathering results. The rst set of graphs show the time spent developing the shared memory code are shown for problem sizes 4x4, 7x7 and 8x8, in Figures 6.15. The development time is measured from after the code had been devectorised, i.e. day 26 by the diary in Table 6.9. The development of the threads and barriers code happened next, using the development method presented in Chapter 4. This was only developed on the SGI, as this was the only one of the target platforms for the BSP and MPI codes which supports shared memory. The graphs for problem sizes 4x4, 7x7 and 8x8 are shown in Figure 6.16. The development time on this graphs starts from development day 62 in the diary. The BSP version of the code was developed from the threads and barriers code, and then the MPI version in turn from the BSP code. The graphs showing the change in e ciency for the development cycles of these two methods can be found in Figure 6.17, for BSP, and Figure 6.18, for MPI. According to the development diary from Table 6.9, the development times shown on these graphs start from development day 92 for BSP, and day 116 for MPI. As stated in Section 2.4, the intention was to see if any particular method had a shorter development time than another. Unfortunately, the measurements for BSP and MPI are speci c to the development method. If the same problem was given to someone from the message passing community, they would not have

CHAPTER 6. RESULTS

136

1.4 1.2 1 Efficiency 0.8 0.6 0.4 0.2 0 0 5 10 15

Naive ideal p1-SM-SGI p4-SM-SGI p1-SM-KSR p4-SM-KSR p16-SM-KSR p24-SM-KSR

Problem Size 4x4


1.4 1.2 1 Efficiency 0.8 0.6 0.4 0.2 0 0 5 10 15 Naive ideal p1-SM-SGI p4-SM-SGI p1-SM-KSR p4-SM-KSR p16-SM-KSR p24-SM-KSR

20 Days

25

30

35

40

Problem size 7x7


1.4 1.2 1 Efficiency 0.8 0.6 0.4 0.2 0 0 5 10 15 Naive ideal p1-SM-SGI p4-SM-SGI p1-SM-KSR p4-SM-KSR p16-SM-KSR p24-SM-KSR

20 Days

25

30

35

40

Problem size 8x8

20 Days

25

30

35

40

Figure 6.15: E ciency of Shared Memory Code versus Development Time

CHAPTER 6. RESULTS

137

1.4 1.2 1 Efficiency 0.8 0.6 0.4 0.2 0 0 5 10 15 Days

Naive ideal p1-TANDB-SGI p4-TANDB-SGI

20

25

30

35

Problem Size 4x4


1.4 1.2 1 Efficiency 0.8 0.6 0.4 0.2 0 0 5 10 15 Days 20 25 30 35 Naive ideal p1-TANDB-SGI p4-TANDB-SGI

Problem size 7x7


1.4 1.2 1 Efficiency 0.8 0.6 0.4 0.2 0 0 5 10 15 Days 20 25 30 35 Naive ideal p1-TANDB-SGI p4-TANDB-SGI

Problem size 8x8

Figure 6.16: E ciency of Threads and Barriers Code versus Development Time

CHAPTER 6. RESULTS

138

1.4 1.2 1 Efficiency 0.8 0.6 0.4 0.2 0 0

Naive ideal p1-BSP-SGI p4-BSP-SGI p1-BSP-T3D p4-BSP-T3D p16-BSP-T3D p32-BSP-T3D p64-BSP-T3D

2 Days

Problem Size 4x4


1.4 1.2 1 Efficiency 0.8 0.6 0.4 0.2 0 0 1 2 Days 3 4 5 Naive ideal p1-BSP-SGI p4-BSP-SGI p1-BSP-T3D p4-BSP-T3D p16-BSP-T3D p32-BSP-T3D p64-BSP-T3D

Problem size 7x7


1.4 1.2 1 Efficiency 0.8 0.6 0.4 0.2 0 0 1 2 Days 3 4 5 Naive ideal p1-BSP-SGI p4-BSP-SGI p1-BSP-T3D p4-BSP-T3D p16-BSP-T3D p32-BSP-T3D p64-BSP-T3D

Problem size 8x8

Figure 6.17: E ciency of BSP Code versus Development Time

CHAPTER 6. RESULTS

139

1.4 1.2 1 Efficiency 0.8 0.6 0.4 0.2 0 0

Naive ideal p1-MPI-SGI p4-MPI-SGI p1-MPI-T3D p4-MPI-T3D p16-MPI-T3D p32-MPI-T3D p64-MPI-T3D

2 Days

Problem Size 4x4


1.4 1.2 1 Efficiency 0.8 0.6 0.4 0.2 0 0 1 2 Days 3 4 5 Naive ideal p1-MPI-SGI p4-MPI-SGI p1-MPI-T3D p4-MPI-T3D p16-MPI-T3D p32-MPI-T3D p64-MPI-T3D

Problem size 7x7


1.4 1.2 1 Efficiency 0.8 0.6 0.4 0.2 0 0 1 2 Days 3 4 5 Naive ideal p1-MPI-SGI p4-MPI-SGI p1-MPI-T3D p4-MPI-T3D p16-MPI-T3D p32-MPI-T3D p64-MPI-T3D

Problem size 8x8

Figure 6.18: E ciency of MPI Code versus Development Time

CHAPTER 6. RESULTS

140

built a threads and barriers implementation from the serial code, and then developed a BSP code, and then nally an MPI code. This means that it is unfair to simply add the time spent implementing the threads-and-barriers and BSP implementations to the time spent writing the MPI code and declare this to be the development time for MPI. For this reason, it is impossible to say that the results given here support any conclusions. To be able to learn something from these results, it would be most useful to compare them with results from a similar exercise conducted by someone from a message-passing background. It is extremely unlikely that such a person would use an intermediate shared memory code to produce a message-passing code. It would also be interesting to see how they would produce a shared-memory code, given that they would most likely construct the message-passing code rst. This would allow the results to be checked for bias towards the background, and research environment, of the developer. Issues concerning such di erences will be discussed more fully when considering future experiments in Section 7.3. The use of the diary to collect information about shared tasks was an interesting exercise, and shows, if anything, that the same issues (e.g. data distribution) need to be considered when writing an e cient parallel code no matter what method is being used.

6.4 Summary
The codes which had been developed in the previous chapter were examined to help evaluate the parallelising methods used to produce them. The results from these evaluations, according to the criteria selected in Chapter 2, were presented. The maintainability criterion was dealt with rst. Arguments were made as to the readability of the codes, and the software metrics proposed in Section 2.2 were used to back these points up, showing that the shared memory method

CHAPTER 6. RESULTS

141

produced the most readable code. For development potential, an example code was presented. A simple modication to this code was proposed, and the complexity of the change considered for each method of parallelisation. Again, the shared memory code was shown to be superior to both Bulk Synchronous Parallel (BSP) and message-passing. Portability, the nal aspect of maintainability, was considered for the three nal codes, and it was clear that the message-passing and BSP codes were more portable than the shared memory code. However, this result was shown to be a consequence of the lack of standardisation in the shared memory community, rather than a consequence of the method itself. The performance results, which were explained in depth, showed no one method to be superior, with the shared memory and message-passing codes displaying similar performance. The ine ciency of the BSP code was possibly due to the early version of the library that was used. Overall, the performance results show no signi cant advantage (or disadvantage) to any particular method. The ease of use results were presented, but were considered to be too speci c to the development method to support any conclusions. The results presented here will now be discussed in more depth, and their implications considered, in the nal chapter of the thesis.

Chapter 7 Conclusions
The evaluations of the parallelising methods have now been completed, and their results have been presented. These results will now be considered and explained in more depth. Following this discussion, conclusions will be drawn as to the most suitable method for parallelising existing codes. A critique of the work will also be presented, suggesting how aspects of the work might have been approached di erently. Following this, possible extensions to the work shall be explored, and further work will be discussed in general. The chapter ends with a summary.

7.1 Discussion of Results


The results for each of the evaluations proposed in Chapter 2 are discussed in turn. The maintainability results which were presented in Section 6.1 are considered rst. In Chapter 2, maintainability was broken down into three key areas: readability, development potential, and portability. Considering readability rst, as stated in Section 6.1.1, it is clear that the shared memory code is the most readable. It should be noted that this is at least in part due to the method used to 142

CHAPTER 7. CONCLUSIONS

143

program the shared memory code { a program written in threads and barriers would not yield such a readable code. In terms of readability, distributed memory models will always do badly. The very nature of programming a distributed memory map by current methods makes it a complex task { the programmer needs to keep in mind that several instances of a particular routine will be running, each with their own copy of the data items. By altering this single routine, the separate instances of the routine must be made to somehow co-operate with each other and solve their own part of the overall problem. Compared with this, a shared memory code programmed using compiler directives which state that the iterations of a loop are independent from each other and may be run on several processors is far easier to understand. These observations were veri ed by software metrics which were presented in Section 6.1.1. These metrics showed the shared memory code to be shorter than the BSP and MPI codes, and to have required less lines to have been changed from the devectorised code. The BSP and MPI codes are almost bound to be longer than the shared memory code as they contain lines of code to explicitly perform interprocessor communications, whereas the shared memory code has none. Comparing BSP with MPI, there are points in each method's favour. In BSP, communication is one-sided, which makes the code more like the serial version than MPI with its two-sided communications. However, MPI handles messages in terms of datatypes, and supports abstract datatypes as well as the base types from which they may be constructed. The claim that the BSP library will combine communications when it is e cient to do so means that time does not have to be spent packing and unpacking data, as is necessary for an e cient MPI code (see Section 5.5). The BSP library also requires that communications be speci ed in terms of a base address, a byte o set into this area, and a number of bytes to be

CHAPTER 7. CONCLUSIONS

144

transferred. To address arrays of two dimensions or more in this way is di cult, requiring knowledge of the sizes of all but one of the array indices to address a single array element. However, it is important to note that these disadvantages of using BSP, together with the lack of any higher level routines (e.g. reduction calls), are all points about the user interface, whereas the points in favour of BSP { the one-sided messaging and the automatic combining of communications { are advantages due to its novel design. With the draft standard for MPI2 containing speci cations for one-sided messaging, the real di erence between BSP and MPI will be reduced considerably by future implementations, leaving the automatic combining and scheduling of communications the only key advantage of BSP { but this is still a signi cant di erence. This means that, if future issues of the BSP library can realise the potential of the BSP model with a good user interface, then BSP codes should certainly become more readable than MPI codes. Again, this was shown to be the case by the metrics in Section 6.1.1, which showed that the BSP code required fewer new data structures than MPI. As new data structures are a major modi cation to the code (see Section 2.2), this suggests that BSP yields more readable codes than MPI. As reasoned in Section 6.1.2, the shared memory code is easiest to develop from the point of view of the non-expert. For all methods, minor changes can be made to existing loops of the code. With the shared memory code, changes can be made to the communication patterns of the code, but di culties arise if new variables are required inside the loop, as the compiler must be directed as to which variables are local or shared. With BSP and MPI, modi cations to loops may introduce new variables, as all variables are local, but, as communication patterns require expert knowledge to alter, any added computation must be restricted to locally available data. As was shown by an example in Section 6.1.2, the shared memory code is the only code to which completely new loops can be added no

CHAPTER 7. CONCLUSIONS

145

matter what data they access. As the new loop will not be parallelised, there will be a degradation in performance, but as least the code can be changed. A disadvantage of the message-passing method is that the last stage of developing a message-passing code is to combine communications to other processors. This explicit packing and unpacking makes the communication patterns of the code more xed. Indeed, if signi cant modi cations are to be made to the code, it may be the case that the version of the code without packing would be a more useful place to start. As stated earlier, in BSP the combining of communications is performed by the library, so the source code does not have to be altered in this manner. As reasoned in Section 6.1.2, it will often be desirable to retain a serial version of the code for use on a workstation, either for running small problems, or for the non-parallelisation-expert to experiment with and develop. As noted, the shared memory code is simply an annotated serial code. This makes maintenance simpler in the case where a serial version of the code is still required, as there will still be only one version of the code. This contrasts with BSP and MPI, where, as argued, the only practical way to do this is to have two versions of the code. In this case, to deal with modi cations to the serial code, the parallelisation expert is required to either make the same modi cations to the parallel code, or to reparallelise the altered serial code from scratch. However, the results presented in Section 6.3 show the development method presented in Chapter 4 to be too long to make the re-development of the code a viable option. This means that either the expert will have to make the same modi cations to the parallel code, or the incremental development method must be shortened. One possible way to shorten the development method would be to automate it, either fully or partially. This fascinating prospect will be discussed fully in Section 7.3. The portability of the BSP and MPI codes is high { both support a wide

CHAPTER 7. CONCLUSIONS

146

variety of platforms, including networks of workstations. There are also freely available implementations of MPI designed speci cally for networks of machines that support heterogeneous networks. In the case of shared memory, the lack of a standard set of parallelising directives means that a new set of directives must be added to the code for it to execute in parallel on a new shared memory platform. Even then, the shared memory code is limited to running on machines which provide a shared memory system { this is a small subset of the systems to which MPI and BSP codes may be ported. However, as pointed out in Section 6.1.3, the lack of portability provided by the shared memory method is due to a lack of standardisation { there are no features of the shared memory method which mean that it could not become portable in principle. As also stated in Section 6.1.3, it is unlikely that the porting of a code will be as frequent an action as either a bug x or a code alteration, and so this cannot be considered as important as either the readability or development potential of a code. This means that overall, programming in shared memory with parallelising compiler directives gives the most maintainable code. Currently MPI gives more maintainable code than BSP due to large readability problems with the current user interface to the BSP library, but this position should be reversed with subsequent issues of the BSP library. Turning now to the performance results presented in Section 6.2, the most noticeable results from the comparative e ciency graphs presented in Figures 6.10 and 6.14 is that the MPI and shared memory codes perform much better than the BSP code. The BSP codes' poor performance may have been largely due to the BSP library, which was still a Beta release. It would have been possible to experiment with the BSP code to try and improve performance, perhaps by combining communications as in the MPI code. However, the BSP library claims to combine messages automatically, as well as avoiding contention by scheduling

CHAPTER 7. CONCLUSIONS

147

communications. However, as noted in Section 5.4, no alterations to the library settings for contention resolution or communication combining provided increased performance. Also, one of the key novel ideas of BSP, as presented in Valiant90], is that communications can be scheduled so that they overlap periods of local computation. To combine communications in the user's code would be to restrict the choices of the library as to when the communications could be performed, and could potentially damage performance. Another attractive aspect of BSP is that direct communications can be performed without necessitating the use of bu ering, which is precisely what combining communications means introducing. It is not possible to determine why the BSP library was so ine cient, but with subsequent releases, the e ciency of the BSP code should increase. Comparing the MPI and shared memory codes' performances on the SGI Challenge provides two key observations: 1. The MPI code is more e cient than the shared memory code for most problem sizes and numbers of processors; and 2. The shared memory code performs consistently for all problem sizes and numbers of processors, outperforming the MPI code on certain problems size and processor combinations. As stated in Section 6.2.2, where the SM-PMON code performs better than the MPI-COMMS code, this was due to a poor division of work onto the processors. The design decision for data distribution was taken when developing the threads and barriers implementation (described in Chapter 4), in order to keep the code's communication patterns simpler. This is a good trade o to make, as the performance of the code is only impaired for a small number of cases. To distribute the data di erently would require new places in the code where communication would be required, and would likely impair performance for the problem sizes and numbers of processors for which the code is currently most e cient.

CHAPTER 7. CONCLUSIONS

148

For the shared memory code, iterations of loops are distributed somehow, and the communications are dealt with transparently by the shared memory system. This means that, even if a certain distribution of iterations results in complicated communication patterns, the programmer need not be concerned { or even know about it. This is an important result, as it means that the shared memory model can cope with complex communication patterns better than distributed models which require explicit communications. It seems unlikely that, on a true shared memory machine, the MPI-COMMS code which has e ectively an extra software communications layer should outperform the SM-PMON code which did not. There are two possible explanations for this behaviour: 1. The shared memory system is slower because more communications are being performed than are necessary, or 2. MPI code is faster because the code is more loosely synchronised than the shared memory code. The latter is an inevitable consequence of using directives to parallelise loops { the processors must synchronise at the end of a parallel loop before the serial code following the loop can be executed. The most likely opportunity for excessive communications in the shared memory code is the use of shared arrays which all processes needed to write to and read from. As mentioned in Chapter 4, it would be possible to observe some false sharing in these circumstances. As the MPI code has a distinct copy of every array, with data being duplicated where it is required by more than one processor, this possibility is eliminated. It would be possible to alter the shared memory code so that data was duplicated for each processor, which should produce a performance improvement. However, this would involve writing code which took di erent actions for di erent processors, which cannot be achieved by simply adding compiler directives, and so is outside the scope of this

CHAPTER 7. CONCLUSIONS

149

evaluation. The duplication of data results in more memory being required by the program, and so the improved performance is an example of trading o temporal overheads against spatial overheads. This trade o was especially attractive, as the arrays which were duplicated were O(N ), and were far smaller than the arrays used to hold the lists of grains and level-one cells which each grain uses in the dipole computation, and so little impact was made on the memory usage of the code. Unfortunately, as stated in Section 6.3, the ease of use results supported no conclusions. A nal di erence between the distributed memory codes and the shared memory code was that, due to the static nature of data in Fortran, the number of processors had to be xed at compile time, as this number determined the size of the distributed arrays for each process. The requirement for every module of the code to be recompiled when the number of processors is altered can be a large overhead, with compilation times for the Cray T3D being up to 5 minutes for a fully optimised executable. To summarise, it is concluded that the shared memory method is the most appropriate method for parallelising legacy codes. It provides more readable codes which are easier to maintain by the non-expert than either message-passing or BSP, which more than o sets the limited portability that the method provides. The performance gures do not show any one method to be more favourable than another, and so provide no further conclusions.

7.2 Evaluation of Work


There are several areas of the thesis where further work would have provided more satisfactory results. For various reasons, mainly time constraints, these areas could not be dealt with thoroughly.

CHAPTER 7. CONCLUSIONS

150

As a whole, the evaluation performed in this thesis su ered from being carried out by just one person, and on just one code. A larger study would allow more conclusive results, and is discussed in Section 7.3. A more in-depth analysis of which parallelisation methods are better at which problem sizes and numbers of processors would have been useful. To investigate this properly would require analysis of communication volumes, and the amount of computation required for the di erent problem sizes. This would allow more general conclusions about which method would yield the most e cient solution for a given problem. The development method presented in Chapter 4 is restrictive, as it requires the use of a shared memory machine in order to develop codes for distributed memory machines. This issue is addressed more fully in Section 7.3. In Section 5.4, the rationale for using a particular data distribution is given. This data distribution is later given as the reason for the MPI code showing poorer performance than the shared memory code for certain problem sizes and numbers of processors. It would have been interesting to implement a new MPI code using exactly the same data distribution as the shared memory code to see if MPI could outperform the shared memory code with this new, more complex communication pattern. It would also be interesting to see if this new code performed as well as the current MPI code for the problem sizes where the MPI code was most e cient. As stated in Section 6.2.1, in the default con guration, the code will only run successfully for a limited problem size, due to exchange e ects. Ideally, the setup for the code would have been altered to allow tests for much larger problem sizes, (e.g. 15x15, 20x20) to have been performed. This would have provided important information as to how the code scales with problem size, and would also have allowed the code to have been run on more processors on the T3D,

CHAPTER 7. CONCLUSIONS

151

thus providing important information about performance for large numbers of processors (i.e. 128, 256 and 512). Finally, more frequent measurements of e ciency for the development diary which was used in Section 6.3 would have allowed more to be said about the evolutionary nature of the shared memory code. As stated in Section 7.1, the shared memory code can be run at any stage in its development. With frequent measurement, e.g. once an hour, this would yield a smooth curve for shared memory, whereas BSP and MPI would still provide sharp lines, thus providing a graphical backup to the argument about the bene ts of the evolutionary approach of programming using parallelising compiler directives.

7.3 Extensions and Further Work


Some areas where the work presented in this thesis could be extended were identi ed. These are presented along with ideas for separate, larger works based upon the conclusions of this thesis. In terms of extending the work contained in the thesis, an evaluation of the performance of the codes produced by the three methods on a larger SGI Challenge machine would be useful. This would provide valuable information as to whether the MPI or shared memory code scales better, which currently can only be estimated by extrapolation of the results. The development of a version of the code for the T3D which used SHMEM calls for communications could be easily developed from the BSP code. Simply replacing `bspput' calls with `shmem put' and `bspsync' with `barrier' would generate a code which would be as e cient as a BSP library con gured not to combine communication or to perform contention resolution could be. This code could be used to evaluate both current and future releases of the BSP library, and might provide important information as to why the BSP code should perform so

CHAPTER 7. CONCLUSIONS

152

badly. As stated in Section 5.3, a precise numerical overhead analysis was not appropriate for simply tuning the shared memory code. However, a complete overhead analysis of all three versions would have been useful. To perform such an analysis would have required extra time to gather results, as well as the time required to port both the MPICH and BSP libraries to the KSR1, which was the only available platform with the required support for such a detailed analysis. Such an analysis would provide an accurate evaluation of the three codes, and would allow more accurate predictions to be made about the suitability of each method to a particular problem than the analysis of communication and computation volumes, suggested in Section 7.2, could permit. The rst area identi ed for future work would be to apply the evaluations detailed in Section 2 to more parallel programming methods, and maybe to some di erent methods from high performance computing. Evaluations of parallel programming methods applied to new codes rather than legacy codes would also be a valuable project. A further study of how easy maintenance is for codes produced by various parallel programming methods would be possible. Areas of interest would be the support provided for debugging codes written using the di erent methods, as well as measuring how long certain classes of code modi cations take to implement. This would provide more quantitative results about the maintainability of codes produced by various methods. A larger study of the kind presented in this thesis would be of bene t, as it would provide more concrete results. Ideally, the study would be performed by many people, with groups of say ve people evaluating a particular parallelising method for a particular code. Several di erent codes could be evaluated, with no group of people working on more than one code or more than one method.

CHAPTER 7. CONCLUSIONS

153

Unfortunately, this requires the co-operation of a large number of people for a long duration of time { here, the shortest development time for a parallel code, starting from the devectorised code, was seven days. Simpler codes could be used, but these must not be so simple that to parallelise them is trivial. Harder to decide is whether to use complete novices to parallel computing, or people with a limited knowledge on the subject. The problem with complete novices is that the learning curve will lengthen the experiment considerably. However, using nonnovices presents issues with di ering backgrounds and experiences in uencing the decisions they make { an example of this was cited in Section 6.3, which posited the di erent approach that a member of the message-passing community would have to the experiment performed in this thesis. This re ects just one of the many problems facing the programming psychology community. For a larger study to take place, not only would you need to account for the di erent behaviour of people from the same environment, but also take into account that people from di erent environments might provide completely di erent results. As stated in Section 6.1.3, the reason for the current poor portability of shared memory codes is due to a lack of standardisation. A useful task would be for an institution or working party to devise and propose a draft of just such a standard. This would be a major breakthrough for the shared memory community. In the interim, work on the automatic transformation of parallelising directives between various platforms, as also mentioned in Section 6.1.3, would be a valuable task. Finally, more work on the evolutionary development method for moving to distributed memory codes, presented in Chapter 4, would be useful. As proposed in Section 7.1, a most interesting way in which the development method could be extended would be to automate the process. Much of the work in the rst part of the method, which concerns the implementation of a threads and barriers

CHAPTER 7. CONCLUSIONS

154

implementation, requires repetitive work with little expert knowledge, e.g. the following of the decision tree in Figure 4.4. Here, there is clearly scope for automation. Partial automation would appear to be the best way forward, where the user is prompted for required expert knowledge, such as the likelihood of a particular subroutine to provide opportunity for parallelism. An important test of an automated version of the method would be its ability to cope with small changes to the input program { ideally the parallelising tool should not have to ask the user all of the same questions as before. To achieve this, information regarding the expert information might be stored in auxiliary les, which could aid the tool when constructing its internal representation of the program. In the example of the suitability of particular routines for parallelisation, this would mean that the user would not be required to re-enter this data for routines which had not changed. Given a suitable method of encoding the expert knowledge, the parallelisation tool should be able to quickly, and with little user-interaction, re-parallelise a code which has been modi ed in a minor way. Ideally, some minor modi cations could be dealt with by the tool without any expert input. To take this idea to its logical conclusion, it might even be possible to use the expert knowledge to build up a knowledge base. Ideally, this would allow the tool to use case-based reasoning Aamodt94] to help parallelise new, previously unseen programs. Another important extension of the development method would be modifying the method so that a shared memory machine is not required. This might involve the development of a library which could provide a bare simulation of an environment which supported multiple instances of a program, with a barrier synchronisation call, and a simple global lock mechanism. To be truly useful, the library would have to be as portable as possible, ideally as portable as MPI or BSP.

CHAPTER 7. CONCLUSIONS

155

7.4 Summary
The results as presented in Chapter 6 were discussed, and it was concluded that the shared memory method was the most suitable for parallelising legacy codes. Although the message passing code was slightly more e cient in most cases, the shared memory code was seen to be more readable, easier for the non-expert to modify, and so easier to maintain in general. This is an important result, as there have been no studies of this kind in the past. An evaluation of the work presented in the thesis was provided, and areas where a more thorough treatment would have been preferable were identi ed. Finally, extensions to the work contained in the thesis were proposed, and possible new projects related to the work were discussed. Of these, the most interesting was the possibility of automating the incremental development method presented in Chapter 4. This novel method is important as it provides an evolutionary way of developing distributed memory model codes from serial codes { a task which has previously relied on revolutionary methods.

Appendix A Subroutine `dmdt'


The following sections contain versions of the subroutine `dmdt'. As stated in Section 5.2, this routine was responsible for around 80% of execution times. The original version is shown rst, followed by the version resulting from the `devectorising' process described in Section 5.2. Lastly, the three nal versions of the routine are given. It should be noted that, as the Cray T3D and KSR1 are true 64-bit architectures, double precision oating point numbers would actually mean 128-bit numbers (or REAL*16). In the case of the KSR1, 128-bit oating point numbers are implemented partly in software, and so are slow, and on the Cray T3D they are not implemented at all. In order to compile these codes to use single precision oating point numbers (REAL*8) without changing the code, the compiler ag `-r8' was used on the KSR1, and `-dp' on the Cray T3D.

A.1 Original Version


subroutine dmdt(dtime,dmthp,dotm) include 'cluster.h' dimension dmthp(ngr,2) dimension dotm(ngr,2) c arrays for mx, my, mz for the individual grains c n.b. grain number 0 is a fictional grain with no moment, to simplify c the use of incomplete neighbour lists. dimension dmx(0:ngr),dmy(0:ngr),dmz(0:ngr) c moments of the level 1 and 2 elements dimension dm1x(nl1),dm1y(nl1),dm1z(nl1) dimension dm2x(nl2),dm2y(nl2),dm2z(nl2) c initialise fictional grains dmx(0)=0. dmy(0)=0. dmz(0)=0. c calculate x,y,z components of m *vocl loop,temp(drsin) do 5 igr=1,ngr drsin=dble(rmsat)*sin(dmthp(igr,1)) dmy(igr)=dble(rmsat)*cos(dmthp(igr,1)) dmx(igr)=drsin*cos(dmthp(igr,2))

156

APPENDIX A. SUBROUTINE `DMDT'


dmz(igr)=drsin*sin(dmthp(igr,2)) 5 continue c calculate the total moments for the grid elements c start with level 1, using the element member list igrlv1 do 10 il1=1,nl1 dm1x(il1)=0. dm1y(il1)=0. dm1z(il1)=0. 10 continue do 20 igr=1,ngrl1 do 20 il1=1,nl1 dm1x(il1)=dm1x(il1)+dmx(igrlv1(il1,igr))*v(igrlv1(il1,igr)) dm1y(il1)=dm1y(il1)+dmy(igrlv1(il1,igr))*v(igrlv1(il1,igr)) dm1z(il1)=dm1z(il1)+dmz(igrlv1(il1,igr))*v(igrlv1(il1,igr)) 20 continue c then do level 2 as a sum of level 1 elements do 25 il2=1,nl2 dm2x(il2)=0. dm2y(il2)=0. dm2z(il2)=0. 25 continue c the array l2bll1 contains the element number of the level 1 element in c the lower left corner of the level 2 element. for all the nl1l2*nl1l2 c level 1 elements in each level 2 element, calculate the offset of the c element number from this bottom left element. do 35 il1z=0,nl1l2-1 il1ofz=il1z*nl1x do 35 il1x=0,nl1l2-1 il1ofs=il1ofz+il1x c then sum the contributions to the level 2 elements do 30 il2=1,nl2 dm2x(il2)=dm2x(il2)+dm1x(l2bll1(il2)+il1ofs) dm2y(il2)=dm2y(il2)+dm1y(l2bll1(il2)+il1ofs) dm2z(il2)=dm2z(il2)+dm1z(l2bll1(il2)+il1ofs) 30 continue 35 continue c calculate total field ht c add together external field, shape and crystal anisotropy (if present) do 100 igr=1,ngr dhxt(igr)=dble(hxext(igr)) + +dble(rd11(igr))*dmx(igr) + +dble(rd12(igr))*dmy(igr) + +dble(rd13(igr))*dmz(igr) dhyt(igr)=dble(hyext(igr)) + +dble(rd21(igr))*dmx(igr) + +dble(rd22(igr))*dmy(igr) + +dble(rd23(igr))*dmz(igr) dhzt(igr)=dble(hzext(igr)) + +dble(rd31(igr))*dmx(igr) + +dble(rd32(igr))*dmy(igr) + +dble(rd33(igr))*dmz(igr) 100 continue c effective field due to k2 if(qk2inc)then *vocl loop,temp(dspmk,dhmag) do 200 igr=1,ngr dspmk=rkx(igr)*dmx(igr)+rky(igr)*dmy(igr)+rkz(igr)*dmz(igr) dhmag=1.0-dspmk*dspmk*dk2f1 dhmag=dhmag*dspmk*dk2f2 dhxt(igr)=dhxt(igr)+dhmag*rkx(igr) dhyt(igr)=dhyt(igr)+dhmag*rky(igr) dhzt(igr)=dhzt(igr)+dhmag*rkz(igr) 200 continue endif c dipole interaction contribution :if(qdipol)then

157

APPENDIX A. SUBROUTINE `DMDT'


c firstly - exact calculation of neighb nearest neighbours. do 500 jgr=1,neighb *vocl loop,temp(dspij,dmxij,dmyij,dmzij) do 400 igr=1,ngr dmxij=dmx(nliste(igr,jgr)) dmyij=dmy(nliste(igr,jgr)) dmzij=dmz(nliste(igr,jgr)) c scalar product of m and rij. n.b. yij = 0.0, so don't bother with my dspij=dmxij*dble(xij(igr,jgr)) + +dmyij*dble(yij(igr,jgr)) + +dmzij*dble(zij(igr,jgr)) c note that fij is always positive for symmetric b.c.'s, c but has significant sign for antisymmetric b.c.'s. c hence field calculations are the same whichever b.c.'s are used. dhxt(igr)=dhxt(igr) + +dble(fij(igr,jgr))*(dspij*dble(xij(igr,jgr))-dmxij) dhyt(igr)=dhyt(igr) + +dble(fij(igr,jgr))*(dspij*dble(yij(igr,jgr))-dmyij) dhzt(igr)=dhzt(igr) + +dble(fij(igr,jgr))*(dspij*dble(zij(igr,jgr))-dmzij) 400 continue 500 continue c secondly - nearest level 1 elements. do 1500 jl1=1,nl1dim *vocl loop,temp(dspij,dmxij,dmyij,dmzij) do 1400 igr=1,ngr dmxij=dm1x(nlist1(igr,jl1)) dmyij=dm1y(nlist1(igr,jl1)) dmzij=dm1z(nlist1(igr,jl1)) dspij=dmxij*dble(xij1(igr,jl1)) + +dmyij*dble(yij1(igr,jl1)) + +dmzij*dble(zij1(igr,jl1)) c note that fij1 is always positive for symmetric b.c.'s, c but has significant sign for antisymmetric b.c.'s. c hence field calculations are the same whichever b.c.'s are used. dhxt(igr)=dhxt(igr) + +dble(fij1(igr,jl1))*(dspij*dble(xij1(igr,jl1))-dmxij) dhyt(igr)=dhyt(igr) + +dble(fij1(igr,jl1))*(dspij*dble(yij1(igr,jl1))-dmyij) dhzt(igr)=dhzt(igr) + +dble(fij1(igr,jl1))*(dspij*dble(zij1(igr,jl1))-dmzij) 1400 continue 1500 continue c finally the remainder of the film, using level 2 elements. c since 75% of the film is covered this way, do not select the c contributing components, just sum over all level 2 elements for best c vectorization rate. the elements which should not contribute have been c taken care of in extfac, by setting fij = xij = zij = 0.0 do 2500 jl2=1,nl2 *vocl loop,temp(dspij) do 2400 igr=1,ngr dspij=dm2x(jl2)*dble(xij2(igr,jl2)) + +dm2y(jl2)*dble(yij2(igr,jl2)) + +dm2z(jl2)*dble(zij2(igr,jl2)) c note that fij2 is always positive for symmetric b.c.'s, c but has significant sign for antisymmetric b.c.'s. c hence field calculations are the same whichever b.c.'s are used. dhxt(igr)=dhxt(igr) + +dble(fij2(igr,jl2)) * *(dspij*dble(xij2(igr,jl2))-dm2x(jl2)) dhyt(igr)=dhyt(igr) + +dble(fij2(igr,jl2)) * *(dspij*dble(yij2(igr,jl2))-dm2y(jl2)) dhzt(igr)=dhzt(igr) + +dble(fij2(igr,jl2)) * *(dspij*dble(zij2(igr,jl2))-dm2z(jl2))

158

APPENDIX A. SUBROUTINE `DMDT'


2400 continue 2500 continue endif c exchange contribution. if(qexchg)then do 4000 jnb=1,nexgnb do 4000 igr=1,ngr dhxt(igr)=dhxt(igr)+dmx(nlistx(igr,jnb))*dble(exmult(igr,jnb)) dhyt(igr)=dhyt(igr)+dmy(nlistx(igr,jnb))*dble(exmult(igr,jnb)) dhzt(igr)=dhzt(igr)+dmz(nlistx(igr,jnb))*dble(exmult(igr,jnb)) 4000 continue endif c calculate dm/dt from landau - lifschitz - gilbert eqn. *vocl loop,temp(dsth,dcth,dsph,dcph,dhth,dhph) do 5000 igr=1,ngr dsth=sin(dmthp(igr,1)) dcth=cos(dmthp(igr,1)) dsph=sin(dmthp(igr,2)) dcph=cos(dmthp(igr,2)) dhth=dhxt(igr)*dcth*dcph+dhzt(igr)*dcth*dsph-dhyt(igr)*dsth dhph=dhzt(igr)*dcph-dhxt(igr)*dsph dotm(igr,1)=dgfct2*dhth-dgfct1*dhph dotm(igr,2)=(dgfct2*dhph+dgfct1*dhth)/dsth 5000 continue return end

159

A.2 Devectorised Version


subroutine dmdt(ngrcopy,dtime,dmthp,dotm) #include "cluster.h" dimension dmthp(ngr,2) dimension dotm(ngr,2) c arrays for mx, my, mz for the individual grains c n.b. grain number 0 is a fictional grain with no moment, to simplify c the use of incomplete neighbour lists. dimension dmx(0:ngr),dmy(0:ngr),dmz(0:ngr) c moments of the level 1 and 2 elements dimension dm1x(nl1),dm1y(nl1),dm1z(nl1) dimension dm2x(jnl2),dm2y(jnl2),dm2z(jnl2) c increment counter nnumit=nnumit+1 c initialise fictional grains dmx(0)=0. dmy(0)=0. dmz(0)=0. c calculate x,y,z components of m do 200 igr=1,ngr drsin=dble(rmsat)*sin(dmthp(igr,1)) dmy(igr)=dble(rmsat)*cos(dmthp(igr,1)) dmx(igr)=drsin*cos(dmthp(igr,2)) dmz(igr)=drsin*sin(dmthp(igr,2)) 200 continue c calculate the total moments for the grid elements c start with level 1, using the element member list igrlv1 c only required for dipole calculations #ifdef CL_USE_DIPOLE c if(qdipol)then do 20 il1=1,nl1 dtmpx=0

APPENDIX A. SUBROUTINE `DMDT'


dtmpy=0 dtmpz=0 do 10 igr=1,ngrl1 dtmpx=dtmpx+dmx(igrlv1(il1,igr))*v(igrlv1(il1,igr)) dtmpy=dtmpy+dmy(igrlv1(il1,igr))*v(igrlv1(il1,igr)) dtmpz=dtmpz+dmz(igrlv1(il1,igr))*v(igrlv1(il1,igr)) 10 continue dm1x(il1)=dtmpx dm1y(il1)=dtmpy dm1z(il1)=dtmpz 20 continue c then do level 2 as a sum of level 1 elements c the array l2bll1 contains the element number of the level 1 element in c the lower left corner of the level 2 element. for all the nl1l2*nl1l2 c level 1 elements in each level 2 element, calculate the offset of the c element number from this bottom left element. do 30 il2=1,nl2 dtmpx=0 dtmpy=0 dtmpz=0 do 35 il1z=0,nl1l2-1 il1ofz=il1z*nl1x do 35 il1x=0,nl1l2-1 il1ofs=il1ofz+il1x c then sum the contributions to the level 2 elements dtmpx=dtmpx+dm1x(l2bll1(il2)+il1ofs) dtmpy=dtmpy+dm1y(l2bll1(il2)+il1ofs) dtmpz=dtmpz+dm1z(l2bll1(il2)+il1ofs) 35 continue dm2x(il2)=dtmpx dm2y(il2)=dtmpy dm2z(il2)=dtmpz 30 continue c endif #endif do 5000 igr=1,ngr c calculate total field ht c add together external field, shape and crystal anisotropy (if present) dtmpx=dble(hxext(igr)) + +dble(rd11(igr))*dmx(igr) + +dble(rd12(igr))*dmy(igr) + +dble(rd13(igr))*dmz(igr) dtmpy=dble(hyext(igr)) + +dble(rd21(igr))*dmx(igr) + +dble(rd22(igr))*dmy(igr) + +dble(rd23(igr))*dmz(igr) dtmpz=dble(hzext(igr)) + +dble(rd31(igr))*dmx(igr) + +dble(rd32(igr))*dmy(igr) + +dble(rd33(igr))*dmz(igr) c effective field due to k2 #ifdef CL_USE_QK2 c if(qk2inc)then dspmk=rkx(igr)*dmx(igr) + +rky(igr)*dmy(igr) + +rkz(igr)*dmz(igr) dhmag=1.0-dspmk*dspmk*dk2f1 dhmag=dhmag*dspmk*dk2f2 dtmpx=dtmpx+dhmag*rkx(igr) dtmpy=dtmpy+dhmag*rky(igr) dtmpz=dtmpz+dhmag*rkz(igr) c endif #endif c exchange contribution. #ifdef CL_USE_EXCHANGE c if(qexchg)then

160

APPENDIX A. SUBROUTINE `DMDT'


do 3000 jnb=1,nexgnb dtmpx=dtmpx+dmx(nlistx(jnb,igr))*dble(exmult(jnb,igr)) dtmpy=dtmpy+dmy(nlistx(jnb,igr))*dble(exmult(jnb,igr)) dtmpz=dtmpz+dmz(nlistx(jnb,igr))*dble(exmult(jnb,igr)) continue endif

161

3000 c #endif c dipole interaction contribution :c firstly - exact calculation of neighb nearest neighbours. #ifdef CL_USE_DIPOLE c if(qdipol)then do 400 jgr=1,neighb dmxij=dmx(nliste(jgr,igr)) dmyij=dmy(nliste(jgr,igr)) dmzij=dmz(nliste(jgr,igr)) c scalar product of m and rij. n.b. yij = 0.0, so don't bother with my dspij=dmxij*dble(xij(jgr,igr)) + +dmyij*dble(yij(jgr,igr)) + +dmzij*dble(zij(jgr,igr)) c note that fij is always positive for symmetric b.c.'s, c but has significant sign for antisymmetric b.c.'s. c hence field calculations are the same whichever b.c.'s are used. dtmpx=dtmpx+ + +dble(fij(jgr,igr)) + *(dspij*dble(xij(jgr,igr))-dmxij) dtmpy=dtmpy+ + +dble(fij(jgr,igr)) + *(dspij*dble(yij(jgr,igr))-dmyij) dtmpz=dtmpz+ + +dble(fij(jgr,igr)) + *(dspij*dble(zij(jgr,igr))-dmzij) 400 continue c secondly - nearest level 1 elements. do 1400 jl1=1,nl1dim dmxij=dm1x(nlist1(jl1,igr)) dmyij=dm1y(nlist1(jl1,igr)) dmzij=dm1z(nlist1(jl1,igr)) dspij=dmxij*dble(xij1(jl1,igr)) + +dmyij*dble(yij1(jl1,igr)) + +dmzij*dble(zij1(jl1,igr)) c note that fij1 is always positive for symmetric b.c.'s, c but has significant sign for antisymmetric b.c.'s. c hence field calculations are the same whichever b.c.'s are used. dtmpx=dtmpx + +dble(fij1(jl1,igr)) + *(dspij*dble(xij1(jl1,igr))-dmxij) dtmpy=dtmpy + +dble(fij1(jl1,igr)) + *(dspij*dble(yij1(jl1,igr))-dmyij) dtmpz=dtmpz + +dble(fij1(jl1,igr)) + *(dspij*dble(zij1(jl1,igr))-dmzij) 1400 continue c dipole interaction contribution :c finally the remainder of the film, using level 2 elements. c since 75% of the film is covered this way, do not select the c contributing components, just sum over all level 2 elements for best c vectorization rate. the elements which should not contribute have been c taken care of in extfac, by setting fij = xij = zij = 0.0 do 2400 jl2=1,nl2 dspij=dm2x(jl2)*dble(xij2(jl2,igr)) + +dm2y(jl2)*dble(yij2(jl2,igr)) + +dm2z(jl2)*dble(zij2(jl2,igr)) c note that fij2 is always positive for symmetric b.c.'s, c but has significant sign for antisymmetric b.c.'s. c hence field calculations are the same whichever b.c.'s are used.

APPENDIX A. SUBROUTINE `DMDT'


+ * + * + * dtmpx=dtmpx +dble(fij2(jl2,igr)) *(dspij*dble(xij2(jl2,igr))-dm2x(jl2)) dtmpy=dtmpy +dble(fij2(jl2,igr)) *(dspij*dble(yij2(jl2,igr))-dm2y(jl2)) dtmpz=dtmpz +dble(fij2(jl2,igr)) *(dspij*dble(zij2(jl2,igr))-dm2z(jl2)) continue endif

162

2400 c #endif c calculate dm/dt from landau - lifschitz - gilbert eqn. dsth=sin(dmthp(igr,1)) dcth=cos(dmthp(igr,1)) dsph=sin(dmthp(igr,2)) dcph=cos(dmthp(igr,2)) dhth=dtmpx*dcth*dcph+dtmpz*dcth*dsph-dtmpy*dsth dhph=dtmpz*dcph-dtmpx*dsph dotm(igr,1)=dgfct2*dhth-dgfct1*dhph dotm(igr,2)=(dgfct2*dhph+dgfct1*dhth)/dsth dhxt(igr)=dtmpx dhyt(igr)=dtmpy dhzt(igr)=dtmpz 5000 continue return end

A.3 Shared Memory Version


subroutine dmdt(ngrcopy,dtime,dmthp,dotm) #include "cluster_system.h" #include "cluster.h" dimension dmthp(ngr,2) dimension dotm(ngr,2) c arrays for mx, my, mz for the individual grains c n.b. grain number 0 is a fictional grain with no moment, to simplify c the use of incomplete neighbour lists. c dimension dmx(0:ngr),dmy(0:ngr),dmz(0:ngr) c moments of the level 1 and 2 elements c dimension dm1x(nl1),dm1y(nl1),dm1z(nl1) c dimension dm2x(jnl2),dm2y(jnl2),dm2z(jnl2) integer itile,nthreads external hrton,hrtoff,number_of_threads #ifdef CL_OS_KSR_OS_PMON external pmonon, pmonoff #endif #ifdef CL_OS_KSR_OS_ELOG external elog_log #endif c do timer stuff call hrton(3) #ifdef CL_OS_KSR_OS_PMON call pmonon(3) #endif #ifdef CL_OS_KSR_OS_ELOG call elog_log(100,0) #endif c increment counter nnumit=nnumit+1

APPENDIX A. SUBROUTINE `DMDT'


c initialise fictional grains dmx(0)=0. dmy(0)=0. dmz(0)=0. c calculate x,y,z components of m call hrton(9) C$DOACROSS LOCAL(igr,drsin),SHARE(rmsat,dmthp,dmx,dmy,dmz) #ifdef CL_OS_KSR_OS_PAR nthreads=number_of_threads() itile=(ngr+nthreads-1)/nthreads c*ksr* tile (igr,strategy=slice,tilesize=(igr:itile), c*ksr*& teamid=nmaint, c*ksr*& private=(igr,drsin)) #endif do 200 igr=1,ngr drsin=dble(rmsat)*sin(dmthp(igr,1)) dmy(igr)=dble(rmsat)*cos(dmthp(igr,1)) dmx(igr)=drsin*cos(dmthp(igr,2)) dmz(igr)=drsin*sin(dmthp(igr,2)) 200 continue #ifdef CL_OS_KSR_OS_PAR c*ksr* end tile #endif call hrtoff(9) c calculate the total moments for the grid elements c start with level 1, using the element member list igrlv1 c only required for dipole calculations #ifdef CL_USE_DIPOLE c if(qdipol)then call hrton(9) C$DOACROSS LOCAL(il1,igr,dtmpx,dtmpy,dtmpz), C$& SHARE(dmx,dmy,dmz,dm1x,dm1y,dm1z,igrlv1) #ifdef CL_OS_KSR_OS_PAR nthreads=number_of_threads() itile=(nl1+nthreads-1)/nthreads c*ksr* tile (il1,strategy=slice,tilesize=(il1:itile), c*ksr*& teamid=nmaint, c*ksr*& private=(il1,igr,dtmpx,dtmpy,dtmpz)) #endif do 20 il1=1,nl1 dtmpx=0 dtmpy=0 dtmpz=0 do 10 igr=1,ngrl1 dtmpx=dtmpx+dmx(igrlv1(il1,igr))*v(igrlv1(il1,igr)) dtmpy=dtmpy+dmy(igrlv1(il1,igr))*v(igrlv1(il1,igr)) dtmpz=dtmpz+dmz(igrlv1(il1,igr))*v(igrlv1(il1,igr)) 10 continue dm1x(il1)=dtmpx dm1y(il1)=dtmpy dm1z(il1)=dtmpz 20 continue #ifdef CL_OS_KSR_OS_PAR c*ksr* end tile #endif call hrtoff(9) c then do level 2 as a sum of level 1 elements c the array l2bll1 contains the element number of the level 1 element in c the lower left corner of the level 2 element. for all the nl1l2*nl1l2 c level 1 elements in each level 2 element, calculate the offset of the c element number from this bottom left element. call hrton(9) C$DOACROSS LOCAL(ol2,dtmpx,dtmpy,dtmpz,il1ofs,l2bll1), C$& SHARE(dm1x,dm1y,dm1z,dm2x,dm2y,dm2z) #ifdef CL_OS_KSR_OS_PAR nthreads=number_of_threads()

163

APPENDIX A. SUBROUTINE `DMDT'


itile=(nl2+nthreads-1)/nthreads c*ksr* tile (il2,strategy=slice,tilesize=(il2:itile), c*ksr*& teamid=nmaint, c*ksr*& private=(ol2,dtmpx,dtmpy,dtmpz,il1ofs,l2bll1)) #endif do 30 il2=1,nl2 dtmpx=0 dtmpy=0 dtmpz=0 l2bll1=nl1l2s*(il2-1) do 35 il1ofs=1,nl1l2s c then sum the contributions to the level 2 elements dtmpx=dtmpx+dm1x(l2bll1+il1ofs) dtmpy=dtmpy+dm1y(l2bll1+il1ofs) dtmpz=dtmpz+dm1z(l2bll1+il1ofs) 35 continue dm2x(il2)=dtmpx dm2y(il2)=dtmpy dm2z(il2)=dtmpz 30 continue #ifdef CL_OS_KSR_OS_PAR c*ksr* end tile #endif call hrtoff(9) c endif #endif call hrton(4) #ifdef CL_OS_KSR_OS_PMON call pmonon(4) #endif call hrton(9) C$DOACROSS LOCAL(dtmpx,dtmpy,dtmpz,igr, C$& dspmk,dhmag, C$& jnb, C$& dmxij,dmyij,dmzij,dspij,jgr, C$& jl1, C$& jl2, C$& dsth,dcth,dsph,dcph,dhth,dhph), C$& SHARE(rd11,rd12,rd13,rd21,rd22,rd23,rd31,rd32,rd33, C$& dmx,dmy,dmz,hxext,hyext,hzext, C$& rkx,rky,rkz,dk2f1, C$& nlistx,exmult, C$& nliste,xij,yij,zij,fij, C$& nlist1,xij1,yij1,zij1,fij1, C$& xij2,yij2,zij2,fij2, C$& dotm,dhxt,dhyt,dhzt) #ifdef CL_OS_KSR_OS_PAR nthreads=number_of_threads() itile=(ngr+nthreads-1)/nthreads c*ksr* tile (igr,strategy=slice,tilesize=(igr:itile), c*ksr*& teamid=nmaint, c*ksr*& private=(dtmpx,dtmpy,dtmpz,igr, c*ksr*& dspmk,dhmag, c*ksr*& jnb, c*ksr*& dmxij,dmyij,dmzij,dspij,jgr, c*ksr*& jl1, c*ksr*& jl2, c*ksr*& dsth,dcth,dsph,dcph,dhth,dhph)) #endif do 5000 igr=1,ngr c calculate total field ht c add together external field, shape and crystal anisotropy (if present) dtmpx=dble(hxext(igr)) + +dble(rd11(igr))*dmx(igr) + +dble(rd12(igr))*dmy(igr) + +dble(rd13(igr))*dmz(igr)

164

APPENDIX A. SUBROUTINE `DMDT'


dtmpy=dble(hyext(igr)) +dble(rd21(igr))*dmx(igr) +dble(rd22(igr))*dmy(igr) +dble(rd23(igr))*dmz(igr) dtmpz=dble(hzext(igr)) + +dble(rd31(igr))*dmx(igr) + +dble(rd32(igr))*dmy(igr) + +dble(rd33(igr))*dmz(igr) c effective field due to k2 #ifdef CL_USE_QK2 c if(qk2inc)then dspmk=rkx(igr)*dmx(igr) + +rky(igr)*dmy(igr) + +rkz(igr)*dmz(igr) dhmag=1.0-dspmk*dspmk*dk2f1 dhmag=dhmag*dspmk*dk2f2 dtmpx=dtmpx+dhmag*rkx(igr) dtmpy=dtmpy+dhmag*rky(igr) dtmpz=dtmpz+dhmag*rkz(igr) c endif #endif c exchange contribution. #ifdef CL_USE_EXCHANGE c if(qexchg)then do 3000 jnb=1,nexgnb dtmpx=dtmpx+dmx(nlistx(jnb,igr))*dble(exmult(jnb,igr)) dtmpy=dtmpy+dmy(nlistx(jnb,igr))*dble(exmult(jnb,igr)) dtmpz=dtmpz+dmz(nlistx(jnb,igr))*dble(exmult(jnb,igr)) 3000 continue c endif #endif c dipole interaction contribution :c firstly - exact calculation of neighb nearest neighbours. #ifdef CL_USE_DIPOLE c if(qdipol)then do 400 jgr=1,neighb dmxij=dmx(nliste(jgr,igr)) dmyij=dmy(nliste(jgr,igr)) dmzij=dmz(nliste(jgr,igr)) c scalar product of m and rij. n.b. yij = 0.0, so don't bother with my dspij=dmxij*dble(xij(jgr,igr)) + +dmyij*dble(yij(jgr,igr)) + +dmzij*dble(zij(jgr,igr)) c note that fij is always positive for symmetric b.c.'s, c but has significant sign for antisymmetric b.c.'s. c hence field calculations are the same whichever b.c.'s are used. dtmpx=dtmpx+ + +dble(fij(jgr,igr)) + *(dspij*dble(xij(jgr,igr))-dmxij) dtmpy=dtmpy+ + +dble(fij(jgr,igr)) + *(dspij*dble(yij(jgr,igr))-dmyij) dtmpz=dtmpz+ + +dble(fij(jgr,igr)) + *(dspij*dble(zij(jgr,igr))-dmzij) 400 continue c secondly - nearest level 1 elements. do 1400 jl1=1,nl1dim dmxij=dm1x(nlist1(jl1,igr)) dmyij=dm1y(nlist1(jl1,igr)) dmzij=dm1z(nlist1(jl1,igr)) dspij=dmxij*dble(xij1(jl1,igr)) + +dmyij*dble(yij1(jl1,igr)) + +dmzij*dble(zij1(jl1,igr)) c note that fij1 is always positive for symmetric b.c.'s, c but has significant sign for antisymmetric b.c.'s. + + +

165

APPENDIX A. SUBROUTINE `DMDT'


c hence field calculations are the same whichever b.c.'s are used. dtmpx=dtmpx + +dble(fij1(jl1,igr)) + *(dspij*dble(xij1(jl1,igr))-dmxij) dtmpy=dtmpy + +dble(fij1(jl1,igr)) + *(dspij*dble(yij1(jl1,igr))-dmyij) dtmpz=dtmpz + +dble(fij1(jl1,igr)) + *(dspij*dble(zij1(jl1,igr))-dmzij) 1400 continue c dipole interaction contribution :c finally the remainder of the film, using level 2 elements. c since 75% of the film is covered this way, do not select the c contributing components, just sum over all level 2 elements for best c vectorization rate. the elements which should not contribute have been c taken care of in extfac, by setting fij = xij = zij = 0.0 do 2400 jl2=1,nl2 dspij=dm2x(jl2)*dble(xij2(jl2,igr)) + +dm2y(jl2)*dble(yij2(jl2,igr)) + +dm2z(jl2)*dble(zij2(jl2,igr)) c note that fij2 is always positive for symmetric b.c.'s, c but has significant sign for antisymmetric b.c.'s. c hence field calculations are the same whichever b.c.'s are used. dtmpx=dtmpx + +dble(fij2(jl2,igr)) * *(dspij*dble(xij2(jl2,igr))-dm2x(jl2)) dtmpy=dtmpy + +dble(fij2(jl2,igr)) * *(dspij*dble(yij2(jl2,igr))-dm2y(jl2)) dtmpz=dtmpz + +dble(fij2(jl2,igr)) * *(dspij*dble(zij2(jl2,igr))-dm2z(jl2)) 2400 continue c endif #endif c calculate dm/dt from landau - lifschitz - gilbert eqn. dsth=sin(dmthp(igr,1)) dcth=cos(dmthp(igr,1)) dsph=sin(dmthp(igr,2)) dcph=cos(dmthp(igr,2)) dhth=dtmpx*dcth*dcph+dtmpz*dcth*dsph-dtmpy*dsth dhph=dtmpz*dcph-dtmpx*dsph dotm(igr,1)=dgfct2*dhth-dgfct1*dhph dotm(igr,2)=(dgfct2*dhph+dgfct1*dhth)/dsth dhxt(igr)=dtmpx dhyt(igr)=dtmpy dhzt(igr)=dtmpz 5000 continue #ifdef CL_OS_KSR_OS_PAR c*ksr* end tile #endif #ifdef CL_OS_KSR_OS_ELOG call elog_log(101,0) #endif call hrtoff(9) call hrtoff(4) #ifdef CL_OS_KSR_OS_PMON call pmonoff(4) #endif call hrtoff(3) #ifdef CL_OS_KSR_OS_PMON call pmonoff(3) #endif return end

166

APPENDIX A. SUBROUTINE `DMDT'

167

A.4 BSP Version


subroutine dmdt(myid,ngrcopy,dtime,dmthp,dotm) #include "cluster_system.h" #include "cluster.h" integer njunk,myid dimension dmthp(ngrppc,2) dimension dotm(ngrppc,2) c arrays for mx, my, mz for the individual grains c n.b. grain number 0 is a fictional grain with no moment, to simplify c the use of incomplete neighbour lists. c dimension dmx(0:ngr),dmy(0:ngr),dmz(0:ngr) c moments of the level 1 and 2 elements c dimension dm1x(nl1),dm1y(nl1),dm1z(nl1) c dimension dm2x(jnl2),dm2y(jnl2),dm2z(jnl2) integer itile,nthreads,istart,iend,ileng external hrton,hrtoff #ifdef CL_OS_KSR_OS_PAR external pthread_barrier_checkout,pthread_barrier_checkin #endif #ifdef CL_OS_KSR_OS_PMON external pmonon, pmonoff #endif nthreads=number_of_threads() c do timer stuff if(myid.eq.0)call hrton(3) #ifdef CL_OS_KSR_OS_PMON if(myid.eq.0)call pmonon(3) #endif dmx(0)=0. dmy(0)=0. dmz(0)=0. c increment counter nnumit=nnumit+1 c initialise fictional grains c calculate x,y,z components of m call bspsync() itile=nl2gr*((nl2+nthreads-1)/nthreads) istart=1+myid*itile iend=min((myid+1)*itile,ngr) do 200 igr=istart,iend igrmod=igr-istart+1 drsin=dble(rmsat)*sin(dmthp(igrmod,1)) dmy(igr)=dble(rmsat)*cos(dmthp(igrmod,1)) dmx(igr)=drsin*cos(dmthp(igrmod,2)) dmz(igr)=drsin*sin(dmthp(igrmod,2)) do 205 i=1,nthreads iproc=nproce(igrmod,i) if(iproc.eq.0)go to 200 call bsphpput(iproc-1,dmx(igr),dmx,dblsiz*igr,dblsiz) call bsphpput(iproc-1,dmy(igr),dmy,dblsiz*igr,dblsiz) call bsphpput(iproc-1,dmz(igr),dmz,dblsiz*igr,dblsiz) 205 continue 200 continue c calculate the total moments for the grid elements c start with level 1, using the element member list igrlv1 c only required for dipole calculations #ifdef CL_USE_DIPOLE c if(qdipol)then

APPENDIX A. SUBROUTINE `DMDT'


itile=nl1l2s*((nl2+nthreads-1)/nthreads) istart=1+myid*itile iend=min((myid+1)*itile,nl1) do 20 il1=istart,iend il1mod=il1-istart+1 dtmpx=0 dtmpy=0 dtmpz=0 do 10 igr=1,ngrl1 dtmpx=dtmpx+dmx(igrlv1(il1mod,igr)) * *v(igrlv1(il1mod,igr)) dtmpy=dtmpy+dmy(igrlv1(il1mod,igr)) * *v(igrlv1(il1mod,igr)) dtmpz=dtmpz+dmz(igrlv1(il1mod,igr)) * *v(igrlv1(il1mod,igr)) 10 continue dm1x(il1)=dtmpx dm1y(il1)=dtmpy dm1z(il1)=dtmpz do 15 i=1,nthreads iproc=nproc1(il1mod,i) if(iproc.eq.0)go to 20 call bsphpput(iproc-1,dm1x(il1),dm1x, , dblsiz*(il1-1),dblsiz) call bsphpput(iproc-1,dm1y(il1),dm1y, , dblsiz*(il1-1),dblsiz) call bsphpput(iproc-1,dm1z(il1),dm1z, , dblsiz*(il1-1),dblsiz) 15 continue 20 continue c then do level 2 as a sum of level 1 elements c the array l2bll1 contains the element number of the level 1 element in c the lower left corner of the level 2 element. for all the nl1l2*nl1l2 c level 1 elements in each level 2 element, calculate the offset of the c element number from this bottom left element. itile=(nl2+nthreads-1)/nthreads istart=1+myid*itile iend=min((myid+1)*itile,nl2) do 30 il2=istart,iend dtmpx=0 dtmpy=0 dtmpz=0 l2bll1=nl1l2s*(il2-1) do 35 il1ofs=1,nl1l2s c then sum the contributions to the level 2 elements dtmpx=dtmpx+dm1x(l2bll1+il1ofs) dtmpy=dtmpy+dm1y(l2bll1+il1ofs) dtmpz=dtmpz+dm1z(l2bll1+il1ofs) 35 continue dm2x(il2)=dtmpx dm2y(il2)=dtmpy dm2z(il2)=dtmpz 30 continue do 33 i=0,nthreads-1 if(i.ne.myid)then call bsphpput(i,dm2x(istart),dm2x,dblsiz*(istart-1), , (iend-istart+1)*dblsiz) call bsphpput(i,dm2y(istart),dm2y,dblsiz*(istart-1), , (iend-istart+1)*dblsiz) call bsphpput(i,dm2z(istart),dm2z,dblsiz*(istart-1), , (iend-istart+1)*dblsiz) endif 33 continue c endif #endif call bspsync()

168

APPENDIX A. SUBROUTINE `DMDT'


if(myid.eq.0)then call hrton(4) #ifdef CL_OS_KSR_OS_PMON call pmonon(4) #endif call hrton(9) endif itile=nl2gr*((nl2+nthreads-1)/nthreads) istart=1+myid*itile iend=min((myid+1)*itile,ngr) if (istart.le.iend) then do 5000 igr=istart,iend igrmod=igr-istart+1 c calculate total field ht c add together external field, shape and crystal anisotropy (if present) dtmpx=dble(hxext(igrmod)) + +dble(rd11(igrmod))*dmx(igr) + +dble(rd12(igrmod))*dmy(igr) + +dble(rd13(igrmod))*dmz(igr) dtmpy=dble(hyext(igrmod)) + +dble(rd21(igrmod))*dmx(igr) + +dble(rd22(igrmod))*dmy(igr) + +dble(rd23(igrmod))*dmz(igr) dtmpz=dble(hzext(igrmod)) + +dble(rd31(igrmod))*dmx(igr) + +dble(rd32(igrmod))*dmy(igr) + +dble(rd33(igrmod))*dmz(igr) c effective field due to k2 #ifdef CL_USE_QK2 c if(qk2inc)then dspmk=rkx(igrmod)*dmx(igr) + +rky(igrmod)*dmy(igr) + +rkz(igrmod)*dmz(igr) dhmag=1.0-dspmk*dspmk*dk2f1 dhmag=dhmag*dspmk*dk2f2 dtmpx=dtmpx+dhmag*rkx(igrmod) dtmpy=dtmpy+dhmag*rky(igrmod) dtmpz=dtmpz+dhmag*rkz(igrmod) c endif #endif c exchange contribution. #ifdef CL_USE_EXCHANGE c if(qexchg)then do 3000 jnb=1,nexgnb dtmpx=dtmpx+ + dmx(nlistx(jnb,igrmod))* * dble(exmult(jnb,igrmod)) dtmpy=dtmpy+ + dmy(nlistx(jnb,igrmod))* * dble(exmult(jnb,igrmod)) dtmpz=dtmpz+ + dmz(nlistx(jnb,igrmod))* * dble(exmult(jnb,igrmod)) 3000 continue c endif #endif c dipole interaction contribution :c firstly - exact calculation of neighb nearest neighbours. #ifdef CL_USE_DIPOLE c if(qdipol)then do 400 jgr=1,neighb dmxij=dmx(nliste(jgr,igrmod)) dmyij=dmy(nliste(jgr,igrmod)) dmzij=dmz(nliste(jgr,igrmod)) c scalar product of m and rij. n.b. yij = 0.0, so don't bother with my dspij=dmxij*dble(xij(jgr,igrmod))

169

APPENDIX A. SUBROUTINE `DMDT'


+ +dmyij*dble(yij(jgr,igrmod)) + +dmzij*dble(zij(jgr,igrmod)) c note that fij is always positive for symmetric b.c.'s, c but has significant sign for antisymmetric b.c.'s. c hence field calculations are the same whichever b.c.'s are used. dtmpx=dtmpx + +dble(fij(jgr,igrmod)) + *(dspij*dble(xij(jgr,igrmod))-dmxij) dtmpy=dtmpy + +dble(fij(jgr,igrmod)) + *(dspij*dble(yij(jgr,igrmod))-dmyij) dtmpz=dtmpz + +dble(fij(jgr,igrmod)) + *(dspij*dble(zij(jgr,igrmod))-dmzij) 400 continue c secondly - nearest level 1 elements. do 1400 jl1=1,nl1dim dmxij=dm1x(nlist1(jl1,igrmod)) dmyij=dm1y(nlist1(jl1,igrmod)) dmzij=dm1z(nlist1(jl1,igrmod)) dspij=dmxij*dble(xij1(jl1,igrmod)) + +dmyij*dble(yij1(jl1,igrmod)) + +dmzij*dble(zij1(jl1,igrmod)) c note that fij1 is always positive for symmetric b.c.'s, c but has significant sign for antisymmetric b.c.'s. c hence field calculations are the same whichever b.c.'s are used. dtmpx=dtmpx + +dble(fij1(jl1,igrmod)) + *(dspij*dble(xij1(jl1,igrmod))-dmxij) dtmpy=dtmpy + +dble(fij1(jl1,igrmod)) + *(dspij*dble(yij1(jl1,igrmod))-dmyij) dtmpz=dtmpz + +dble(fij1(jl1,igrmod)) + *(dspij*dble(zij1(jl1,igrmod))-dmzij) 1400 continue c dipole interaction contribution :c finally the remainder of the film, using level 2 elements. c since 75% of the film is covered this way, do not select the c contributing components, just sum over all level 2 elements for best c vectorization rate. the elements which should not contribute have been c taken care of in extfac, by setting fij = xij = zij = 0.0 do 2400 jl2=1,nl2 dspij=dm2x(jl2) * *dble(xij2(jl2,igrmod)) + +dm2y(jl2) * *dble(yij2(jl2,igrmod)) + +dm2z(jl2) * *dble(zij2(jl2,igrmod)) c note that fij2 is always positive for symmetric b.c.'s, c but has significant sign for antisymmetric b.c.'s. c hence field calculations are the same whichever b.c.'s are used. dtmpx=dtmpx + +dble(fij2(jl2,igrmod)) * *(dspij*dble(xij2(jl2,igrmod)) -dm2x(jl2)) dtmpy=dtmpy + +dble(fij2(jl2,igrmod)) * *(dspij*dble(yij2(jl2,igrmod)) -dm2y(jl2)) dtmpz=dtmpz + +dble(fij2(jl2,igrmod)) * *(dspij*dble(zij2(jl2,igrmod)) -dm2z(jl2)) 2400 continue c endif

170

APPENDIX A. SUBROUTINE `DMDT'


#endif c calculate dm/dt from landau - lifschitz - gilbert eqn. dsth=sin(dmthp(igrmod,1)) dcth=cos(dmthp(igrmod,1)) dsph=sin(dmthp(igrmod,2)) dcph=cos(dmthp(igrmod,2)) dhth=dtmpx*dcth*dcph+dtmpz*dcth*dsph-dtmpy*dsth dhph=dtmpz*dcph-dtmpx*dsph dotm(igrmod,1)=dgfct2 * *dhth-dgfct1*dhph dotm(igrmod,2)=(dgfct2* * dhph+dgfct1*dhth)/dsth dhxt(igrmod)=dtmpx dhyt(igrmod)=dtmpy dhzt(igrmod)=dtmpz 5000 continue endif if(myid.eq.0)call hrtoff(4) #ifdef CL_OS_KSR_OS_PMON if(myid.eq.0)call pmonoff(4) #endif if(myid.eq.0)call hrtoff(3) #ifdef CL_OS_KSR_OS_PMON if(myid.eq.0)call pmonoff(3) #endif return end

171

A.5 MPI Version


subroutine dmdt(myid,ngrcopy,dtime,dmthp,dotm) #include #include #include #include "cluster.h.verytop" "mpif.h" "cluster_system.h" "cluster.h"

integer njunk,myid,reqid,mpistat,mpistats,requests,nexcep dimension dmthp(ngrppc,2) dimension dotm(ngrppc,2) dimension dgrxyz(3) dimension requests(ngr) dimension mpistat(MPI_STATUS_SIZE) dimension mpistats(MPI_STATUS_SIZE,ngr) c arrays for mx, my, mz for the individual grains c n.b. grain number 0 is a fictional grain with no moment, to simplify c the use of incomplete neighbour lists. c dimension dmx(0:ngr),dmy(0:ngr),dmz(0:ngr) c moments of the level 1 and 2 elements c dimension dm1x(nl1),dm1y(nl1),dm1z(nl1) c dimension dm2x(jnl2),dm2y(jnl2),dm2z(jnl2) integer itile,nthreads,istart,iend,ileng external hrton,hrtoff #ifdef CL_OS_KSR_OS_PAR external pthread_barrier_checkout,pthread_barrier_checkin #endif #ifdef CL_OS_KSR_OS_PMON external pmonon, pmonoff #endif call mpi_comm_size(MPI_COMM_WORLD,nthreads,ierr) call check_mpi(ierr,300) c do timer stuff

APPENDIX A. SUBROUTINE `DMDT'


if(myid.eq.0)call hrton(3) #ifdef CL_OS_KSR_OS_PMON if(myid.eq.0)call pmonon(3) #endif dmx(0)=0. dmy(0)=0. dmz(0)=0. c increment counter nnumit=nnumit+1 c initialise fictional grains c calculate x,y,z components of m itile=nl2gr*((nl2+nthreads-1)/nthreads) istart=1+myid*itile iend=min((myid+1)*itile,ngr) reqid=0 do 200 igr=istart,iend igrmod=igr-istart+1 drsin=dble(rmsat)*sin(dmthp(igrmod,1)) dmy(igr)=dble(rmsat)*cos(dmthp(igrmod,1)) dmx(igr)=drsin*cos(dmthp(igrmod,2)) dmz(igr)=drsin*sin(dmthp(igrmod,2)) do 205 i=1,nthreads iproc=nproce(igrmod,i) if(iproc.eq.0)go to 200 call gadd(myid,iproc-1,igr,dmx(igr),dmy(igr),dmz(igr)) 205 continue 200 continue call gbcast(myid,nthreads,0) c calculate the total moments for the grid elements c start with level 1, using the element member list igrlv1 c only required for dipole calculations #ifdef CL_USE_DIPOLE c if(qdipol)then itile=nl1l2s*((nl2+nthreads-1)/nthreads) istart=1+myid*itile iend=min((myid+1)*itile,nl1) reqid=0 do 20 il1=istart,iend il1mod=il1-istart+1 dtmpx=0 dtmpy=0 dtmpz=0 do 10 igr=1,ngrl1 dtmpx=dtmpx+dmx(igrlv1(il1mod,igr)) * *v(igrlv1(il1mod,igr)) dtmpy=dtmpy+dmy(igrlv1(il1mod,igr)) * *v(igrlv1(il1mod,igr)) dtmpz=dtmpz+dmz(igrlv1(il1mod,igr)) * *v(igrlv1(il1mod,igr)) 10 continue dm1x(il1)=dtmpx dm1y(il1)=dtmpy dm1z(il1)=dtmpz do 15 i=1,nthreads iproc=nproc1(il1mod,i) if(iproc.eq.0)go to 20 call gadd(myid,iproc-1,il1, , dm1x(il1),dm1y(il1),dm1z(il1)) 15 continue 20 continue call gbcast(myid,nthreads,1) c calculate the total moments for the grid elements c then do level 2 as a sum of level 1 elements c the array l2bll1 contains the element number of the level 1 element in c the lower left corner of the level 2 element. for all the nl1l2*nl1l2 c level 1 elements in each level 2 element, calculate the offset of the

172

APPENDIX A. SUBROUTINE `DMDT'


c element number from this bottom left element. itile=(nl2+nthreads-1)/nthreads istart=1+myid*itile iend=min((myid+1)*itile,nl2) do 30 il2=istart,iend dtmpx=0 dtmpy=0 dtmpz=0 l2bll1=nl1l2s*(il2-1) do 35 il1ofs=1,nl1l2s c then sum the contributions to the level 2 elements dtmpx=dtmpx+dm1x(l2bll1+il1ofs) dtmpy=dtmpy+dm1y(l2bll1+il1ofs) dtmpz=dtmpz+dm1z(l2bll1+il1ofs) 35 continue dm2x(il2)=dtmpx dm2y(il2)=dtmpy dm2z(il2)=dtmpz 30 continue call gbcast(myid,nthreads,2) c endif #endif if(myid.eq.0)then call hrton(4) #ifdef CL_OS_KSR_OS_PMON call pmonon(4) #endif call hrton(9) endif itile=nl2gr*((nl2+nthreads-1)/nthreads) istart=1+myid*itile iend=min((myid+1)*itile,ngr) if (istart.le.iend) then do 5000 igr=istart,iend igrmod=igr-istart+1 c calculate total field ht c add together external field, shape and crystal anisotropy (if present) dtmpx=dble(hxext(igrmod)) + +dble(rd11(igrmod))*dmx(igr) + +dble(rd12(igrmod))*dmy(igr) + +dble(rd13(igrmod))*dmz(igr) dtmpy=dble(hyext(igrmod)) + +dble(rd21(igrmod))*dmx(igr) + +dble(rd22(igrmod))*dmy(igr) + +dble(rd23(igrmod))*dmz(igr) dtmpz=dble(hzext(igrmod)) + +dble(rd31(igrmod))*dmx(igr) + +dble(rd32(igrmod))*dmy(igr) + +dble(rd33(igrmod))*dmz(igr) c effective field due to k2 #ifdef CL_USE_QK2 c if(qk2inc)then dspmk=rkx(igrmod)*dmx(igr) + +rky(igrmod)*dmy(igr) + +rkz(igrmod)*dmz(igr) dhmag=1.0-dspmk*dspmk*dk2f1 dhmag=dhmag*dspmk*dk2f2 dtmpx=dtmpx+dhmag*rkx(igrmod) dtmpy=dtmpy+dhmag*rky(igrmod) dtmpz=dtmpz+dhmag*rkz(igrmod) c endif #endif c exchange contribution. #ifdef CL_USE_EXCHANGE c if(qexchg)then do 3000 jnb=1,nexgnb

173

APPENDIX A. SUBROUTINE `DMDT'


+ * + * + * dtmpx=dtmpx+ dmx(nlistx(jnb,igrmod))* dble(exmult(jnb,igrmod)) dtmpy=dtmpy+ dmy(nlistx(jnb,igrmod))* dble(exmult(jnb,igrmod)) dtmpz=dtmpz+ dmz(nlistx(jnb,igrmod))* dble(exmult(jnb,igrmod)) continue endif

174

3000 c #endif c dipole interaction contribution :c firstly - exact calculation of neighb nearest neighbours. #ifdef CL_USE_DIPOLE c if(qdipol)then do 400 jgr=1,neighb dmxij=dmx(nliste(jgr,igrmod)) dmyij=dmy(nliste(jgr,igrmod)) dmzij=dmz(nliste(jgr,igrmod)) c scalar product of m and rij. n.b. yij = 0.0, so don't bother with my dspij=dmxij*dble(xij(jgr,igrmod)) + +dmyij*dble(yij(jgr,igrmod)) + +dmzij*dble(zij(jgr,igrmod)) c note that fij is always positive for symmetric b.c.'s, c but has significant sign for antisymmetric b.c.'s. c hence field calculations are the same whichever b.c.'s are used. dtmpx=dtmpx + +dble(fij(jgr,igrmod)) + *(dspij*dble(xij(jgr,igrmod))-dmxij) dtmpy=dtmpy + +dble(fij(jgr,igrmod)) + *(dspij*dble(yij(jgr,igrmod))-dmyij) dtmpz=dtmpz + +dble(fij(jgr,igrmod)) + *(dspij*dble(zij(jgr,igrmod))-dmzij) 400 continue c secondly - nearest level 1 elements. do 1400 jl1=1,nl1dim dmxij=dm1x(nlist1(jl1,igrmod)) dmyij=dm1y(nlist1(jl1,igrmod)) dmzij=dm1z(nlist1(jl1,igrmod)) dspij=dmxij*dble(xij1(jl1,igrmod)) + +dmyij*dble(yij1(jl1,igrmod)) + +dmzij*dble(zij1(jl1,igrmod)) c note that fij1 is always positive for symmetric b.c.'s, c but has significant sign for antisymmetric b.c.'s. c hence field calculations are the same whichever b.c.'s are used. dtmpx=dtmpx + +dble(fij1(jl1,igrmod)) + *(dspij*dble(xij1(jl1,igrmod))-dmxij) dtmpy=dtmpy + +dble(fij1(jl1,igrmod)) + *(dspij*dble(yij1(jl1,igrmod))-dmyij) dtmpz=dtmpz + +dble(fij1(jl1,igrmod)) + *(dspij*dble(zij1(jl1,igrmod))-dmzij) 1400 continue c dipole interaction contribution :c finally the remainder of the film, using level 2 elements. c since 75% of the film is covered this way, do not select the c contributing components, just sum over all level 2 elements for best c vectorization rate. the elements which should not contribute have been c taken care of in extfac, by setting fij = xij = zij = 0.0 do 2400 jl2=1,nl2 dspij=dm2x(jl2)

APPENDIX A. SUBROUTINE `DMDT'


* *dble(xij2(jl2,igrmod)) + +dm2y(jl2) * *dble(yij2(jl2,igrmod)) + +dm2z(jl2) * *dble(zij2(jl2,igrmod)) c note that fij2 is always positive for symmetric b.c.'s, c but has significant sign for antisymmetric b.c.'s. c hence field calculations are the same whichever b.c.'s are used. dtmpx=dtmpx + +dble(fij2(jl2,igrmod)) * *(dspij*dble(xij2(jl2,igrmod)) -dm2x(jl2)) dtmpy=dtmpy + +dble(fij2(jl2,igrmod)) * *(dspij*dble(yij2(jl2,igrmod)) -dm2y(jl2)) dtmpz=dtmpz + +dble(fij2(jl2,igrmod)) * *(dspij*dble(zij2(jl2,igrmod)) -dm2z(jl2)) 2400 continue #endif c calculate dm/dt from landau - lifschitz - gilbert eqn. dsth=sin(dmthp(igrmod,1)) dcth=cos(dmthp(igrmod,1)) dsph=sin(dmthp(igrmod,2)) dcph=cos(dmthp(igrmod,2)) dhth=dtmpx*dcth*dcph+dtmpz*dcth*dsph-dtmpy*dsth dhph=dtmpz*dcph-dtmpx*dsph dotm(igrmod,1)=dgfct2 * *dhth-dgfct1*dhph dotm(igrmod,2)=(dgfct2* * dhph+dgfct1*dhth)/dsth dhxt(igrmod)=dtmpx dhyt(igrmod)=dtmpy dhzt(igrmod)=dtmpz 5000 continue endif if(myid.eq.0)call hrtoff(4) #ifdef CL_OS_KSR_OS_PMON if(myid.eq.0)call pmonoff(4) #endif if(myid.eq.0)call hrtoff(3) #ifdef CL_OS_KSR_OS_PMON if(myid.eq.0)call pmonoff(3) #endif return end

175

subroutine grinit(myid) #include #include #include #include #include "cluster.h.verytop" "mpif.h" "cluster_system.h" "cluster.h" "cluster_mpi.h"

integer myid do 5 i=1,CLUSTER_NUM_THREADS gcount(i)=0 5 continue return end

APPENDIX A. SUBROUTINE `DMDT'


subroutine gbcast(myid,nthreads,glevel) #include #include #include #include #include "cluster.h.verytop" "mpif.h" "cluster_system.h" "cluster.h" "cluster_mpi.h"

176

integer myid,ierr,reqid,igr,il1,gtag,i,iproc,glevel,nthreads integer mpistat,mpistats,requests integer sproc,itile,istart,iend,idst,nrecs,rbuf integer istart2,iend2,idst2 dimension requests(CLUSTER_NUM_THREADS) dimension mpistat(MPI_STATUS_SIZE) dimension mpistats(MPI_STATUS_SIZE,CLUSTER_NUM_THREADS) reqid=0 if(glevel.eq.2)then itile=((nl2+nthreads-1)/nthreads) istart=1+myid*itile iend=min((myid+1)*itile,nl2) idst=iend-istart+1 c nrecs is the number of processors which have level two elements nrecs=(nl2+itile-1)/itile rbuf=mod(myid+1,CLUSTER_NUM_THREADS)+1 if (idst.gt.0) then call dcopy(idst,dm2x(istart),1,l2buff(1,rbuf),1) call dcopy(idst,dm2y(istart),1,l2buff(idst+1,rbuf),1) call dcopy(idst,dm2z(istart),1,l2buff(2*idst+1,rbuf),1) c if we are a processor with level two elements - we want to only receive c from nrecs-1 nrecs=nrecs-1 endif else nrecs=nrecve endif do 10 i=0,CLUSTER_NUM_THREADS-1 if ((i.ne.myid).and.((glevel.eq.2).or.(gcount(i+1).gt.1)))then if(glevel.eq.2)then gtag=ngr+nl1+myid+1 if(idst.gt.0)then reqid=reqid+1 call mpi_isend , (l2buff(1,rbuf),idst*3,MPI_DOUBLE_PRECISION, , i,gtag,MPI_COMM_WORLD,requests(reqid),ierr) call check_mpi(ierr,9901) endif else reqid=reqid+1 if(glevel.eq.1)then gtag=gcount(i+1)+ngr else gtag=gcount(i+1) endif call mpi_isend , (buffer(1,1,i+1),gcount(i+1)*4,MPI_DOUBLE_PRECISION, , i,gtag,MPI_COMM_WORLD,requests(reqid),ierr) call check_mpi(ierr,9901) gcount(i+1)=0 endif endif 10 continue

APPENDIX A. SUBROUTINE `DMDT'


do 20 iproc=1,nrecs call mpi_recv , (buffer(1,1,myid+1),ngr*4,MPI_DOUBLE_PRECISION, , MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,mpistat,ierr) call check_mpi(ierr,9902) gtotal=mpistat(MPI_TAG) if(gtotal.gt.(ngr+nl1))then sproc=gtotal-ngr-nl1-1 itile=((nl2+nthreads-1)/nthreads) istart2=1+sproc*itile iend2=min((sproc+1)*itile,nl2) idst2=iend2-istart2+1 call dcopy(idst2,l2buff(1,myid+1),1,dm2x(istart2),1) call dcopy(idst2,l2buff(idst2+1,myid+1),1,dm2y(istart2),1) call dcopy(idst2,l2buff(2*idst2+1,myid+1),1,dm2z(istart2),1) else if(gtotal.gt.ngr)then do 30 i=1,gtotal-ngr il1=grindx(1,i,myid+1) dm1x(il1)=buffer(2,i,myid+1) dm1y(il1)=buffer(3,i,myid+1) dm1z(il1)=buffer(4,i,myid+1) 30 continue else do 40 i=1,gtotal igr=grindx(1,i,myid+1) dmx(igr)=buffer(2,i,myid+1) dmy(igr)=buffer(3,i,myid+1) dmz(igr)=buffer(4,i,myid+1) 40 continue endif endif 20 continue call mpi_waitall(reqid,requests,mpistats,ierr) call check_mpi(ierr,9903) return end

177

subroutine gadd(myid,pdest,igr,dgrx,dgry,dgrz) #include #include #include #include #include "cluster.h.verytop" "mpif.h" "cluster_system.h" "cluster.h" "cluster_mpi.h"

integer myid,pdest,igr real*8 dgrx,dgry,dgrz gcount(pdest+1)=gcount(pdest+1)+1 grindx(1,gcount(pdest+1),pdest+1)=igr buffer(2,gcount(pdest+1),pdest+1)=dgrx buffer(3,gcount(pdest+1),pdest+1)=dgry buffer(4,gcount(pdest+1),pdest+1)=dgrz return end

Appendix B Execution Times


This chapter contains the execution times for the three parallel codes. Times are provided for all versions of each code, using the naming scheme which was introduced in Chapter 5 to refer to the di erent versions. In all cases, execution times are provided for problem sizes 4x4, 7x7 and 8x8 { the problem sizes selected in Section 6.2.1.

B.1.1 Shared Memory


SM-ONEP
Number of Processors 1 2 3 4 Number of Processors 1 2 3 4

B.1 SGI Challenge


Execution Times (s) 4x4 7x7 8x8 10.18 34.05 53.18 5.73 18.57 28.91 3.96 13.53 20.61 3.24 10.92 16.60 Execution Times (s) 4x4 7x7 8x8 10.06 34.28 52.61 5.08 17.88 27.36 3.52 12.34 18.59 2.75 9.37 14.39

SM-ALLP

178

APPENDIX B. EXECUTION TIMES

179

SM-ELOG

Number of Processors 1 2 3 4 Number of Processors 1 2 3 4

Execution Times (s) 4x4 7x7 8x8 10.09 34.31 53.05 5.12 17.76 27.37 3.50 12.04 18.80 2.68 9.37 14.45 Execution Times (s) 4x4 7x7 8x8 9.91 34.55 53.02 5.03 17.88 27.46 3.40 12.36 18.63 2.57 9.59 14.50

SM-PMON

B.1.2 Bulk Synchronous Processing


BSP-FIRST
Number of Processors 1 2 3 4 Number of Processors 1 2 3 4 Execution Times (s) 4x4 7x7 8x8 9.90 34.32 53.38 5.14 18.05 25.83 4.34 12.96 18.05 4.58 11.37 14.66 Execution Times (s) 4x4 7x7 8x8 9.89 34.29 53.12 5.20 18.03 26.21 4.40 13.09 17.96 4.36 11.12 14.41

BSP-TUNED

APPENDIX B. EXECUTION TIMES

180

B.1.3 Message Passing Interface


MPI-FIRST
Number of Processors 1 2 3 4 Number of Processors 1 2 3 4 Execution Times (s) 4x4 7x7 8x8 9.94 34.90 53.23 6.11 21.73 29.69 6.01 17.24 23.68 5.26 14.87 18.71 Execution Times (s) 4x4 7x7 8x8 10.01 34.54 52.28 4.84 17.62 25.27 3.44 12.07 16.81 2.33 9.39 12.87

MPI-COMMS

APPENDIX B. EXECUTION TIMES

181

B.2 KSR1
SM-ONEP

B.2.1 Shared Memory


Number of Processors 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Execution Times (s) 4x4 7x7 8x8 18.12 71.47 159.56 9.80 39.75 80.36 7.19 25.99 52.64 5.83 20.07 41.92 4.96 16.47 33.57 4.48 15.09 28.95 4.18 13.46 26.15 3.80 12.57 23.38 3.57 11.70 21.71 3.36 10.92 20.42 3.29 10.35 19.14 3.14 9.99 18.06 3.04 9.67 16.79 3.01 9.15 16.47 2.91 9.08 15.73 2.83 8.76 15.19 2.81 8.63 14.49 2.77 8.35 13.87 2.72 8.18 13.65 2.71 8.05 13.26 2.66 7.93 13.11 2.74 7.82 12.79 2.62 7.75 12.48 2.58 7.69 12.17

APPENDIX B. EXECUTION TIMES

182

SM-ALLP

Number of Processors 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Execution Times (s) 4x4 7x7 8x8 17.84 69.88 162.41 9.60 32.18 70.33 6.86 22.85 46.28 5.31 16.95 35.81 4.45 14.15 29.75 3.66 11.56 24.12 3.45 10.20 21.81 3.07 9.36 18.45 2.86 8.12 17.33 2.83 7.80 15.78 2.46 6.91 13.89 2.50 6.73 12.70 2.42 6.24 11.97 2.31 5.77 11.08 2.24 5.55 10.44 2.20 5.35 10.17 2.14 5.07 9.68 2.06 4.95 9.09 1.99 4.79 9.00 2.02 4.78 8.58 2.19 4.42 8.70 2.18 4.54 8.14 2.19 4.37 7.77 1.90 4.27 7.50

APPENDIX B. EXECUTION TIMES

183

SM-ELOG

Number of Processors 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Execution Times (s) 4x4 7x7 8x8 17.33 65.06 139.40 9.12 30.57 60.22 6.50 20.70 40.97 5.12 15.74 30.48 4.22 12.85 24.88 3.65 11.00 20.85 3.29 9.57 17.97 2.91 8.54 16.02 2.82 7.83 14.51 2.54 7.15 13.08 2.39 6.48 12.14 2.31 6.11 11.15 2.30 5.78 10.44 2.22 5.59 9.90 2.05 5.26 9.26 2.03 5.06 8.92 1.96 4.91 8.38 1.86 4.61 7.99 1.94 4.52 7.73 1.83 4.37 7.38 1.86 4.29 7.11 1.91 4.39 6.92 1.94 4.31 7.18 1.79 3.99 6.94

APPENDIX B. EXECUTION TIMES

184

SM-PMON

Number of Processors 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Execution Times (s) 4x4 7x7 8x8 16.89 68.29 141.66 9.15 30.50 59.87 6.51 20.77 40.51 5.03 15.78 30.48 4.21 12.92 24.93 3.76 11.00 20.90 3.26 9.48 17.94 2.98 8.40 15.95 2.79 7.72 14.38 2.57 6.99 12.96 2.50 6.58 11.90 2.29 6.16 11.12 2.24 6.04 10.43 2.16 5.42 9.82 2.31 5.10 9.30 1.99 5.03 8.77 2.29 4.75 8.31 1.96 4.59 7.91 1.93 4.33 7.64 1.90 4.20 7.37 1.87 4.07 7.19 1.90 4.02 6.95 1.93 3.96 6.85 1.86 3.85 6.52

B.3.1 Bulk Synchronous Processing


BSP-FIRST
Number of Processors 1 2 4 8 16 32 64 Execution Times (s) 4x4 7x7 8x8 9.51 25.50 70.76 5.24 14.36 N/A 3.07 8.28 14.12 1.74 4.86 7.44 1.17 3.62 4.08 1.22 2.22 2.36 1.50 1.46 1.77

B.3 Cray T3D

APPENDIX B. EXECUTION TIMES

185

BSP-TUNED

Number of Processors 1 2 4 8 16 32 64

Execution Times (s) 4x4 7x7 8x8 9.48 25.61 70.74 5.18 14.36 N/A 3.07 8.29 14.16 1.83 4.88 7.43 1.12 3.63 4.10 1.22 2.12 2.36 1.36 1.46 1.58

B.3.2 Message Passing Interface


MPI-FIRST
Number of Processors 1 2 4 8 16 32 64 Number of Processors 1 2 4 8 16 32 64 Execution Times (s) 4x4 7x7 8x8 9.49 25.25 70.28 5.70 14.35 35.19 3.61 9.01 15.37 2.20 5.57 8.27 1.74 4.47 4.76 1.91 3.05 3.30 1.88 2.90 3.08 Execution Times (s) 4x4 7x7 8x8 9.35 25.44 70.10 4.83 13.06 33.79 2.55 7.29 13.36 1.37 4.10 6.86 0.93 2.92 3.61 1.00 1.62 2.03 1.10 1.21 1.41

MPI-COMMS

Bibliography
Aamodt94] Aamodt A and Plaza E, Case-Based Reasoning: Foundational Issues, Methodological Variations and Systems Approaches, AICOM, Vol. 7, No. 1, March 1994. Adve95] Adve S V and Gharachorloo K, Shared Memory Consistency Models: A Tutorial, DEC Western Reseach Laboratories Research Report 95/7, September 1995. ANSI78] ANSI, Programming Language Fortran, American National Standard, X3.9-1978. ANSI92] ANSI, Programming Language Fortran 90, American National Standard, X3.198-1992. Barnes86] Barnes J and Hut P, A Hierarchical O(NlogN ) Force-Calculation Algorithm, Nature, Vol. 324, No. 4, December 1986, pp 446. Bodin93] Bodin F, Kervella L and Priol T, Fortran-S: A Fortran Interface for Shared Virtual Memory Architectures, Proceedings of Supercomputing `93, November 1993. Board95] Board J A, Hakura Z S, Elliot W D and Rankin W T, Scalable Variants of Multipole-based Algorithms for Molecular Dynamics Applications, Proceedings of the 7th SIAM Conference on Parallel Programming for Scienti c Computing, 1995, pp 295{300. Brooks83] Brooks R E, Towards a Theory of the Comprehension of Computer Programs, International Journal of Man-Machine Studies, Vol. 18, 1983, pp 543{554. Bull96] Bull J M, A Hierarchical Classi cation of Overheads in Parallel Programs, Proceedings of First IFIP TC10 International Workshop on Parallel and Distributed Software Engineering, Chapman and Hall, March 1996, pp 208{219. Burks46] Burks A W, Goldstine H H and von Neumann J, Preliminary Discussion of the Logical Design of an Electronic Computing Instrument originally 1946, in Taub61], pp 35{79. 186

BIBLIOGRAPHY

187

Cheatham94] Cheatham T, Fahmy A, Stefanescu D C and Valiant L G, Bulk Synchronous Parallel Computing { A Paradigm for Transportable Software, Technical Report TR-36-94, Center for Research in Computing Technology, Harvard University, Cambridge, Massachusetts, December 1994. CrayMPP] Cray Research Incorporated, Cray MPP Fortran Reference Manual, Cray Research Incorporated, Part No. SR-2504 6.2. Cripps87] Cripps M, Field T and Reeve M, An Introduction to ALICE: a Multiprocessor Graph Reduction Machine, in `Functional Programming: Languages, Tools and Architectures' (ed. Eisenbach S), Ellis Horwood, Market Cross House, Chichester, England. Culler93] Culler D et al, LogP: Towards a Realistic Model of Parallel Computation, Proceedings of the 4th ACM SIGPLAN Symposium on the Principles and Practice of Parallel Programming, May 1993, pp 1{12. Dijkstra68] Dijkstra E W, GOTO statement considered harmful, Communications of the ACM, Vol. 11, No. 3, 1968, pp 147{148. Dongarra79] Dongarra J and Hind A R, Unrolling Loops in Fortran, Software Practice and Experience, Vol. 9, No. 3, March 1979, pp 219-226. Fenton91] Fenton N E, Software Metrics: A Rigorous Approach, Chapman and Hall, New York, 1991. Ford95] Ford R and Poll D I A, A Parallel Processing Approach to Laminar Flow Control System Design, Scienti c Programming Vol. 4, No. 3, 1995. Fortune78] Fortune S and Wyllie J, Parallelism in Random Access Machines, Proceedings of the ACM Symposium on the Theory of Computing, 1978, pp 114{118. Foster95] Foster I, Designing and Building Parallel Programs, Addison Wesley, New York, 1995. Geist93] Geist G et al, PVM 3 User's Guide and Reference Manual, Technical Report ORNL/TM-12187, Oak Ridge National Laboratories, Oak Ridge, Tennessee, May 1993. Gerbessiotis92] Gerbessiotis A V and Valiant L G, Direct Bulk-Synchronous Parallel Algorithms, Technical Report TR-10-92, Center for Research in Computing Technology, Harvard University, Cambridge, Massachusetts, 1992. Goudreau95] Gourdreau M W, Lang K, Rao S B and Tsantilas T, The Green BSP Library, Technical Report CS-TR-95-11, University of Central Florida, Orlando, 1995.

BIBLIOGRAPHY

188

Goudreau96] Goudreau M W et al, A Proposal for the BSP Worldwide Standard Library (preliminary version), Technical report, Oxford University Computing Laboratory, Oxford, April 1996. Grady92] Grady R, Practical Software Metrics for Project Management and Process Improvement, Prentice-Hall, Englewood Cli s, New Jersey, 1992. Greengard87] Greengard L and Rokhlin V, A Fast Algorithm for Particle Simulations, Journal of Computational Physics, Vol. 73, 1987, pp. 325-348 Grunwald93] Grunwald D and Vajracharya S, E cient Barriers for Distributed Shared Memory Computers, Technical Report CU-CS-703-94-93, Department of Computer Science, University of Colorado, Sep. 1993. Hill96] Hill J M D, Crumpton P I and Burgess D A, The Theory, Practice, and a Tool for BSP Performance Prediction Applied to a CFD Application, Technical Report PRG-TR-4-1996, Oxford University Computing Laboratory, Oxford, 1996. HPFForum93] High Performance Fortran Forum, High Performance Fortran Language Speci cation, Scienti c Programming, Vol. 2, Nos. 1 and 2, 1993. Holt95] Holt C and Singh J P Hierarchical N-body Methods on Shared Address Space Multiprocessors, Proceedings of the 7th SIAM Conference on Parallel Programming for Scienti c Computing, 1995, pp 313{318. Hu96] Hu Y and Johnsson S L, Implementing O(N ) N-Body Algorithms E ciently in Data-Parallel Languages, Scienti c Programming, Vol. 5, No. 4, 1996. Hudak92] Hudak P et al, Report on the functional programming language Haskell, Version 1.2, SIGPLAN Notices 27, May 1992. Intel95] Intel Corporation, Pentium Processor Family Developer's Manual, Volume 3: Architecture and Programming Manual, Intel Corporation, Order No. 241430, 1995. Kernighan78] Kernighan B W and Ritchie D M, The C Programming Language, Prentice-Hall, Englewood Cli s, New Jersey, 1978. Lanning94] Lanning A L and Khoshgoftaar T M, Modeling the Relationship Between Source Code Complexity and Maintenance Di culty, IEEE Computer, Vol. 27, No. 9, September 1994, pp 35{40. Levelt92] Levelt W G, Kaashoek M F, Bal H E and Tanenbaum A S, A Comparison of Two Paradigms for Distributed Shared Memory, Software { Practice and Experience, Vol. 22, November 1992, pp 985{1010.

BIBLIOGRAPHY

189

Lilja94] Lilja D J, Exploiting the Parallelism Available in Loops, IEEE Computer, Vol. 27, No. 2, February 1994, pp. 13{26. Miles91] Miles J J and Middleton B K, A Hierarchical Micromagnetic Model of Lobngitudinal Thin Film Recording Media, Journal of Magnetism and Magnetic Materials, Vol. 95, 1991, pp 99{108. Miller93] Miller R, A Library for Bulk Synchronous Parallel Programming, British Computer Society Parallel Processing Group, General Purpose Parallel Computing, December 1993. MPICHHome] MPICH World Wide Web Home Page, http://www.mcs.anl.gov/mpi/mpich. MPIForum94] Message Passing Interface Forum, MPI: A Message-Passing Interface Standard, International Journal of Supercomputer Applications and High Performance Computing, Vol. 8, Nos. 3 and 4, 1994. Murray93] Murray K, Osmon P E, Valsamidis A, Whitcroft A and Wilkinson T, Experiences with Distributed Shared Memory, Technical Report TCU/SARC/1993/3, Systems Architecture Research Centre, Department of Computer Science, City University, London, 1993. Neumann51] von Neumann J, The General and Logical Theory of Automata, originally 1951, in Taub61] O'Boyle95] O'Boyle M F P, Kervella L and Bodin F, Synchronization Minimization in a SPMD Execution Model, Journal of Parallel and Distributed Computing, Vol. 29, 1995, pp 196{210. PCFForum90] Parallel Computing Forum (PCF), PCF Fortran Extensions { Draft Document, Revision 2.11, Kuck and Associates, 1906 Fox Drive, Champaign, Illinois 61820, March 1990. Peyton-Jones96a] Peyton Jones S L, Gordon A and Finne S, Concurrent Haskell, Proceedings of the 23rd ACM Symposium on Principles of Programming Languages, January 1996. Peyton-Jones96b] Peyton Jones S L, Compiling Haskell by Program Transformation: a Report from the Trenches, Proceedings of the European Symposium on Programming Languages, April 1996. Riley96] Riley G D, Techniques for Improving the Performance of Parallel Computations, MSc (II) Thesis, University of Manchester, October 1996. Russell78] Russell R M, The CRAY-1 computer system, Communications of the ACM, Vol. 21, 1978, pp 63{72.

BIBLIOGRAPHY

190

Sakellariou96] Sakellariou R, On the Quest for Perfect Load Balance in LoopBased Parallel Computations, PhD Thesis, Department of Computer Science, University of Manchester, October 1996. Sheil81] Sheil B A, The Psychological Study of Programming, ACM Computing Surveys, Vol. 13, No. 1, 1981, pp 101{120. Slotnick62] Slotnick D L, Borck W C and McReynolds R C, The SOLOMON computer, AFIPS Conference Proceedings, Vol. 22, 1962, pp 97{107. Slotnick67] Slotnick D L, Unconventional Systems, AFIPS Conference Proceedings, Vol. 30, 1967, pp 477-481 Singh93] Singh J P, Parallel Hierarchical N-body Methods and their Implecations for Multiprocessors, PhD Thesis, Stanford University, February 1993. Stark94] Stark G, Durst R C and Vowell C W, Using Metrics in Management Decision Making, IEEE Computer, Vol. 27, No. 9, September 1994, pp 42{48. Stroustrup94] Stroustrup B, The C++ Programming Language (Second Edition), Addison Wesley, 1994 Taub61] Taub A H, John von Neumann: Collected Works. Volume V: Design of Computer, Theory of Automata and Numerical Analysis, Pergamon Press, Oxford, 1961. Valiant90] Valiant L, A Bridging Model for Parallel Computation, Communications of the ACM, Vol. 33, No. 8,August 1990, pp 103{111. WWBSPHome] WorldWide BSP World Wide Web Home Page, http://www.bsp-worldwide.org/. Zhang94] Zhang X and Deng H, Distributed Image Edge Detection Methods and Performance, Technical Report TR-94-06-02, High-Performance Computing and Software Laboratory, The University of Texas at San Antonio, San Antonio, Texas, 1994.