Beruflich Dokumente
Kultur Dokumente
1 Introduction
1.1 Context
The National Laboratory for High Performance Computing (NLHPC) project
aims to install in Chile a supercomputing infrastructure that would meet the
scientific domestic demand for high performance computing (HPC), offering
high quality services and promoting their use in problems of both basic and
applied research, and industrial applications. In recent years the development
of applied science and industry has been led by the sophisticated use of infor-
mation and communication technologies (ICT), a process in which HPC has
played an important role. In Chile, some areas of science as well as some in-
dustrial sectors have reached a level of maturity that, to maintain its global
competitiveness requires the use of technologies related to the HPC. Identifying
the opportunities that the availability of this technology will bring the country,
most of the research universities of Chile, led by the Center for Mathematical
Modeling (CMM), University of Chile (UChile) proposed to CONICYT the cre-
ation of the National Laboratory for Computer High Performance (NLHPC).
The NLHPC is created by the University of Chile as Sponsoring Institution and
as Associated Institutions (AI), the universities Pontificia Universidad Catlica
de Chile (PUC), Universidad Tcnica Federico Santa Mara (UTFSM), Universi-
dad de Santiago (USACH), Universidad de la Frontera (UFRO), Universidad de
Talca (UTalca) and Universidad Catolica del Norte (UCN), in association with
REUNA. The NLHPC supercomputing infrastructure will be composed by sev-
eral HPC clusters distributed among the members of this laboratory. All these
clusters will be connected via high speed networks provided by REUNA. The
central processing node it is hosted at the Center for Mathematical Modeling
(CMM), Faculty of Physical Sciences and Mathematics (FCFM) of the UChile,
1
a center of excellence in scientific research, with extensive experience in manag-
ing large collaborative projects. The present manual, describes the actual HPC
infrastructure of the CMM called “Levque”. This name comes from the word
thunder in Mapudungun, the language of the Mapuche, the main native nation
of Chile.
2 Hardware
2.1 Cluster Architecture
The cluster architecture can be divided into three main areas: the Computing
Area, the Storage Area and the Administration Area. These areas are inter-
connected by means of two separated networks, each one playing different roles
within the HPC infrastructure. The computing area is used to perform the
scientific computations and it is composed by several compute nodes, each one
equipped with multicore CPUs, where users run their applications. The storage
area provides a scalable persistent layer to the data required by the comput-
ing area. The administration area is used to facilitate the interaction with the
computing area and to perform the monitoring of the whole infrastructure as
well as the correctness of the users’ jobs. These areas are bonded through the
interconnection network and the administration network. The former one is
used for computing purposes (I/O and IPC for example) and the latter one is
used to operate, maintain and monitor the HPC infrastructure.
In particular, the Levque Cluster architecture is composed by a computing
area of 67 nodes, representing 536 cores exclusively dedicated to run users’
jobs. The storage area is composed by 5 nodes, four dedicated only to I/O
operations summarizing 8 TB of available space, which is managed by one server
known as the meta-data server. Finally, the administration area is composed
2
(a) Interconnection Network (b) Ethernet Network
by four nodes: two acting as the head of the cluster (two master nodes in
fail over configuration) and two acting as the interface of the cluster (one for
users and one for grid computing). The interconnection network is a packet
switched network based on the Infiniband (IB) technology capable of reaching a
throughput of 40 Gb/s by port with a very low latency end-to-end. Each node
in the cluster is equipped with an IB Host Card Adapter (HCA) with two ports,
which are both connected to a switch capable of growing up to 432 Infiniband
ports (adding leaf modules). The administration network is composed by 5
Ethernet switches providing a link rate up-to 4 Gb/s (4 x 1Gb/s), from which
users can log into the cluster to run their jobs and recover their results. Figure
1a depicts the how the interconnection network bond the above defined areas
and Figure 1b depicts the administration network layout.
In the following, we introduce in detail each area, covering more technical
aspects of the above described architecture.
4
2.4 Administration Area
The administration area is composed by two parts: the administration nodes
and the administration network. The first one consists in two master nodes,
one active and one passive in fail over configuration, and a frontend node, from
which users can interact with the cluster. Both nodes are equipped with two
quad core Intel Xeon E5540 processors @ 2.67 Ghz with 28 GB RAM. The front
end node is reachable through the hostname development.dim.uchile.cl. The
second part is composed by 5 Giga-Ethernet switches, each of of 48 ports each
one. They are interconnected in such a way to provide a redundant connectivity
with the computing and storage area. Also the monitoring tasks are performed
over this network, which are aggregated by the master node of the cluster. It is
important to remark that also the master node is monitoring the power supply
facility which is providing up-to 60 KVA of electrical power to keep the whole
HPC infrastructure always up and running.
2.5 Networking
The Levque cluster uses two interconnection networks: a high speed network
called Infiniband and an Ethernet network. The former one is used to commu-
nicate the compute nodes for calculations and I/O purposes, and the latter one
is used for administration and user interaction.
The Infiniband network used by
Levque is connected by means of a
QDR 100% no blocking fabric switch
providing a high bandwidth of 40
Gb/s per port and a latency of ap-
prox. 100 nsec. Through this net-
work, the compute nodes are allowed
to use a message passing communi-
cation library (MPI) as well as IP
(layer 3) services. In addition, the
Levque IB network provides the Re-
mote Direct Memory Access (RDMA)
method which gives users the ability
to communicate different process in
a memory-to-memory access scheme
(almost similar to a shared memory
system).
The Ethernet network is a 1 Gib/s
network and exhibit latencies about
1 ms at layer 3, mostly generated
by the OS network stack. This net-
work is used to interact with the clus-
ter through a frontend (login) node,
Figure 3: QLogic 12800-180 Infiniband
which is accessed via a secure shell
Switch
5
communication protocol (ssh). This
interaction includes command shell
and file operations at the moment (no visualization services are provided).
3 Software
The software loaded into the Levque cluster can be divided in two categories: 1)
administrative software and 2) scientific software. The administrative software
is divided in three areas: a) base software, b) development tools and c) libraries.
The scientific software is divided in two areas: d) licensed software and e) open-
source software. The licensed software has a restricted use due to the license
agreement. For further details on how to use your own licensed software, please
contact the NLHPC personnel. The opensource software available at Levque
can be used by anyone.
In the following, each area is listed with the version of the software and their
respective environment where they are available. The environment settings are
detailed in the Section 4.6. It is worth to mention that, due to space reasons,
only the most important ones are listed.
• Developing Tools
• Libraries
• OpenSource
6
Name Version Modulefile or HomePath
Gnu compilers 4.1.2 /usr/bin
4.4.0 /usr/bin
Intel compilers 11.1.072 /opt/intel/Compiler/11.1/072
PGI compilers 10.9 /opt/pgi/linux86-64/10.9
Openmpi with gnu 4.1.2 1.4.2 openmpi/1.4.2
1.4.3 openmpi/1.4.3
Openmpi with gnu 4.4.0 1.4.3 openmpi gcc44/1.4.3
Openmpi with intel 11.1 1.4.3 openmpi intel/1.4.3
Openmpi with pgi 10.9 1.4.3 openmpi pgi/1.4.3
Valgrind 3.5.0 /usr/bin
Boost 1.33.1 /usr/bin
Java 1.6.0 /usr/bin
Python 2.4.3 /usr/bin
2.6.6 python/2.6.6
Perl 5.8.8 /usr/bin/
7
Name Version Modulefile or HomePath
Cplex 12.2.0 cplex/12.2.0
12.1.0 cplex/12.2.0
9.1.3 cplex/9.1.3
Stata 11.0 stata/11.0
Matlab2 7 /opt/matlab7
Gaussian 09B01 gaussian/09B01
Fluent 12.0 fluent/12.0
12.1 fluent/12.1
Comsol multiphysics 4 comsol/4.0
Knitro3 6.0 /opt/knitro/knitro-6.0.0-student
7.0 /opt/knitro/knitro-7.0.0-z
laying on any graphical interface. Nowadays there is only one way to access the
Levque Cluster. However, in the near future there will be three ways to access
it. Therefore, at the moment (March, 2011) the access is made via the frontend
node development.dim.uchile.cl, and to access this node it is mandatory to
have an account in the Levque Cluster. The login operation is made through the
SSH protocol, therefore, it is necessary to have a client application supporting
the ssh protocol. There are several applications for different operating systems.
For Linux, the ssh client is part OpenSSH package. For Windows, it is possi-
ble to download a SSH client from the URL http://www.putty.org/. And for
MAC OS, the following URL http://openssh.com/macos.html describes several
alternatives to have a SSH client.
Once logged onto the Levque cluster (the frontend node), the user interacts
with the operating system as usual in any *NIX operating system. In this way,
the user prepares the execution of its scientific application by means of a script
which is submitted to a workload system. This workload system schedules
the execution as a job. It is important to highlight that it is completely
forbidden to for long times run scientific applications in the frontend
node. We understand as long time more than 1 minute. For testing purposes,
users may run for short times their applications to ensure their correct execution.
8
Figure 4: User interaction diagram
9
the stdout and the stderr are consolidated in a single file (detailed later on).
This output file is called as the name of the job with an extension composed
by the letter “o” or “e” plus the Job Id assigned by the workload system. For
instance, for a job called “test” and a Job ID 33530, the output files will be
“test.o33530” and “test.e33530”, assuming both stdout and stderr are not joint.
In general, the steps to execute an application in the Levque Cluster are the
following:
All the technical aspect related to the job submission, scheduling and exe-
cution on the Levque Cluster is controlled by the Oracle Grid Engine or OGE
software[6] (formerly known as Sun Grid Engine or SGE). The OGE allows the
limitation of the total number of job/slots for each user running simultaneously
on the cluster. However, the user may submit as much jobs as needed with the
constraint that they will be queued and waiting for available resources to be
executed. The policies to manage the sharing of resources among users are de-
fined by Scientific Strategic Committee (SSC) of the NLHPC and implemented
at the OGE.
Notice that the HPC resources (number of cores, wall/user time, storage
space, RAM, etc) are assigned to users obeying to a policy of fair use. In the
near future this policy will be changed for open calls were users can request HPC
resources. Through these calls, twice a year, users requesting HPC resources
will submit their research proposals to the NLHPC. An external committee will
review the merit of the proposals, ranking them to be presented to the SSC.
Further details on this Call for Proposal will be available in the NLHPC web-
site. Also, the SSC and/or NLHPC Executive Committee may assign different
priorities to users according the strategy plan of development of the NLHPC.
So, the workload system will be not a FIFO queue (First In, First Out), but
there will be a formula in the scheduling task to calculate the position in the
queue.
10
3 #$ −j n
4 #$ −n o t i f y
5 #$ −M c l a u d i o . b a e z a . r@gmail . com
6 #$ −m a b e s
7 #$ −N t e s t
8 #$ −S / b i n / c s h
9 #$ −q a l l . q
10 source / e t c / p r o f i l e . d/ modules . c s h
11 module l o a d p g i
12 . / myprogram
• Line 1: defines the command shell interpreter that is used to execute the
script when running it outside the OGE environment.
• Line 2: use the current directory as the working directory.
• Line 3: “n” for not and “y” for yes. Join or do not join the output/error
messages into a single file.
• Line 4: Notification before the end of the job. The OGE will send a
SIGUSR2 signal to your application, 60 seconds before the time specified
by -l h rt option, to notify it before killing it, to allow it to perform some
cleanup and to save results before loosing everything. Our application
should then intercept this signal, otherwise -notify option would have no
effect.
• Line 5: Defines an email where any notification about the job state will
be sent.
• Line 6: This option is used together with the previous option and defines
the event by which the email will be sent.
– b: at the beginning of the job.
– e: at the end of the job.
– a: when the job is aborted or rescheduled.
– s: when the job is suspended.
• Line 7: This option defines the name of the job. It is used when displaying
the list of running jobs.
• Line 8: Specifies the command interpreter (shell) that will be used to
execute the job inside the OGE environment.
• Line 9: Indicates the queue in which the job will be executed.
11
• Line 10: Here the execution script begins. The language to use must agree
with the shell interpreter specified at the line 8.
There are many ways to configure a job script file. But, there are 3 classic
jobs types that are implemented by the OGE: the standard job, the parallel job,
the job array. The first one is meant to execute a single task in a single CPU
(with the memory limitations of a single node). The parallel job and the array
job defines more complex scenarios.
To run a parallel application, the user must use a determined parallel envi-
ronment according the previous list. This parallel environment is defined in the
job script as it is depicted in the following example:
1 #! / b i n / c s h
2 #$ −cwd
3 #$ −j n
4 #$ −n o t i f y
5 #$ −M c l a u d i o . b a e z a . r@gmail . com
6 #$ −m a b e s
7 #$ −N t e s t
8 #$ −S / b i n / c s h
9 #$ −pe openmpi 64
10 #$ −q a l l . q
12
11 source / e t c / p r o f i l e . d/ modules . c s h
12 module l o a d o p e n m p i i n t e l
13 mpirun m y p a r a l l e l p r o g r a m
Notice the line 9, where the directive “-pe” is used to define the execution of
the “myparallelprogram” under the “openmpi” environment with 64 cores. The
allocation policies of process within an environment is defined by the OGE.
13
5 myprogram 1 3 4 . 4 80−out . dat
Notice the command returns the Job ID assigned to the execution of the
test.sh script (in this case the PID 33530). For further information, use the
command man qsub.
The example showed above indicates the load on each queue (CQLOAD),
the cores that are being in use (USED), the cores in reserved state (RES), the
cores available in the queue (AVAIL) and the total of cores (TOTAL). The two
following columns give the number of jobs that are in the states:
14
• a - Load threshold alarm
• o - Orphaned
• A - Suspend threshold alarm
• C - Suspended by calendar
• D - Disabled by calendar
• c Configuration ambiguous
• d Disabled
• s Suspended
• u Unknown
• E Error
Others useful commands to track the state of the queue and the jobs are:
In addition, to manage the job in the workload system, there exist several
commands from which we mention only the most important ones according our
criteria.
15
4.6 Working Environments: the module utility
The module utility is a user interface that provides dynamic modification of
the user’s environment via command line. For instance, when handling several
versions of the same application or library. Each modulefile contains the informa-
tion needed to configure the environment for an application. Once the module
package is initialized, the environment can be modified on a per-module basis
using the module command which interprets modulefiles. Typically modulefiles
instruct the module command to alter or set shell environment variables such
as PATH, MANPATH, etc. For example, to know what modules are available;
the command to issue is module avail or module list.
[claudio@development ~]$ module list
Currently Loaded Modulefiles:
1) openmpi/1.4.3
[claudio@development ~]$ module avail
To load a specific environment, the command line to issue is: module load
modulefile, being modulefile the name of the module (environment to be precise)
to load; and to unload the module, the command line is analogue.
[claudio@development ~]$ module load openmpi_intel/1.4.3
[claudio@development ~]$ module list
Currently Loaded Modulefiles:
1) openmpi_intel/1.4.3
[claudio@development ~]$ which mpicc
/opt/openmpi_intel/1.4.3/bin/mpicc
[claudio@development ~]$
[claudio@development ~]$ module list
Currently Loaded Modulefiles:
1) openmpi_gcc44/1.4.3 2) cplex/9.1.0
[claudio@development ~]$ module unload cplex
[claudio@development ~]$ module list
Currently Loaded Modulefiles:
1) openmpi_gcc44/1.4.3
[claudio@development ~]$
16
(Makefiles[3] for example) and bash scripting[1]. At the end of this document,
several documents are referenced as recommended readings to improve the user
understanding on these examples.
17
9 ∗ U n i v e r s i d a d de C h i l e
10 ∗ March , 2011
11 ∗/
12
13 #include <i o s t r e a m >
14 #include <t i m e . h>
15 #include " B i s e c t i o n M e t h o d . h "
16
17 using namespace s t d ;
18
19 i n t main ( i n t a r g c , const char ∗ a r g v [ ] ) {
20
21 // c h e c k t h e arguments p r o v i d e d by t h e command l i n e
22 i f ( argc !=4) {
23 c o u t << " u s a g e : " << a r g v [ 0 ] << " a _ 0 b _ 0 e p s i l o n " << e n d l ;
24 return 9 ; // we a s s o c i a t e 9 t o t h e i n v a l i d arguments e r r o r
25 }
26
27 double a 0 = a t o f ( a r g v [ 1 ] ) ;
28 double b 0 = a t o f ( a r g v [ 2 ] ) ;
29 double e p s i l o n = a t o f ( a r g v [ 3 ] ) ;
30
31 if ( e p s i l o n <=0) {
32 c o u t << " e p s i l o n m u s t b e g r e a t e r t h a n 0 " << e n d l ;
33 return 9 ;
34 }
35
36 c o u t << " f i n d t h e r o o t o f f ( x ) = x ^ 2 - 1 b y t h e b i s e c t i o n m e t h o d w i t h
e p s i l o n = " << e p s i l o n << e n d l ;
37
38 // s t r a t e g y :
39 // 1 . we i n s t a n c i a t e t h e s o l v e r c l a s s ( B i s e c t i o n M e t h o d ) . I f t h e
40 // i n i t i a l i n t e r v a l i s i n v a l i d , we c a t c h t h e c o r r e s p o n d i n g
exception
41 // to inform the e r r o r .
42 // 2 . we s t a r t t h e i t e r a t i o n method . i f i t do n o t c o n v e r g e a f t e r
MAX ITERATIONS
43 // d e f i n e d by t h e s o l v e r c l a s s , we c a t c h t h e D i v e r g e c e E x c e p t i o n .
44 // 3 . I f we dont g e t any e x c e p t i o n , t h e method has f o u n d a s o l u t i o n
45 // i i f no o t h e r e x c e p t i o n i s t r i g g e r e d . we c a p t u r e t h o s e t o know
the
46 // p r o c e s s h as t e r m i n a t e d a n o r m a l l y
47 // 4 . we q u e r y t h e s o l v e r i f we g o t a s o l u t i o n , we show i t ,
o t h e r w i s e , we
48 // i n f o r m t h a t t h e r e i s no s o l u t i o n .
49
50 // we measure t h e e x e c u t i o n t i m e
51 c l o c k t s t a r t , end ;
52 start = clock () ;
53
54 try {
55 // i n s t a n c i a t e t h e b i s e c t i o n method
56 B i s e c t i o n M e t h o d bm( a 0 , b 0 , e p s i l o n ) ;
57
58 // s t a r t t h e c o m p u t a t i o n
59 c o u t << " c o m p u t i n g . . . " << e n d l ;
60 bm . s t a r t ( ) ;
61
62 // g e t t h e e n d i n g c l o c k
63 end = c l o c k ( ) ;
64
65 // p r i n t t h e e x e c u t i o n t i m e
66 double e x e c t i m e = ( double ) ( end−s t a r t ) /CLOCKS PER SEC ;
67 c o u t << " E x e c u t i o n t i m e : " << e x e c t i m e << " m s " << e n d l ;
68
69 // c h e c k wh e th e r a r o o t ha s been f o u n d o r n o t .
70 i f (bm . e x i s t R o o t ( ) ) {
71 // show t h e r e s u l t s
18
72 c o u t << " R o o t f o u n d : f ( x ) = 0 w i t h x = " << bm . g e t R o o t ( ) << " i n
[ " << a 0 << " , " << b 0 << " ] " << e n d l ;
73 } else {
74 c o u t << " N o s o l u t i o n f o u n d " << e n d l ;
75 }
76 } catch ( I n v a l i d I n t e r v a l E x c e p t i o n& ex ) {
77
78 // t h i s c o d e i s e x e c u t e d when t h e NoRootException has been throw
away .
79 // t h e what ( ) method t e l l us t h e r e a s o n o f t h e e x c e p t i o n
80 c o u t << " I n v a l i d i n i t i a l i n t e r v a l , e x i t i n g . . . " << e n d l ;
81
82 // we r e t u r n 1 a s s o c i a t i n g i t t o t h e i n v a l i d interval error .
83 return 1 ;
84 } catch ( D i v e r g e n c e E x c e p t i o n& ex ) {
85
86 // t h i s c o d e i s e x e c u t e d when t h e NoRootException has been throw
away .
87 // t h e what ( ) method t e l l us t h e r e a s o n o f t h e e x c e p t i o n
88 c o u t << " t h e m e t h o d d o n o t c o n v e r g e " << e n d l ;
89
90 // we r e t u r n 2 a s s o c i a t i n g i t t o t h e d i v e r g e n c e problem .
91 return 2 ;
92
93 } catch ( e x c e p t i o n& ex ) {
94 // t h i s c o d e i s e x e c u t e d when any o t h e r E x c e p t i o n has been throw
away .
95 // t h e what ( ) method t e l l us t h e r e a s o n o f t h e e x c e p t i o n
96 c o u t << " E R R O R : w e c a t c h a n e x c e p t i o n . r e a s o n " << ex . what ( ) <<
endl ;
97
98 // we r e t u r n 3 a s s o c i a t i n g i t t o any o t h e r e r r o r r e a s o n
99 return 3 ;
100 }
101
102 // r e t u r n w i t h 0 t o i n f o r m t h e i n t e r p r e t e r
103 // t h e a p p l i c a t i o n has ended n o r m a l l y .
104 // any v a l o r b i g g e r than 0 i s c o n s i d e r e d an annormal t e r m i n a t i o n
105
106 return 0 ;
107 }
108
109 // Note t h a t we do n o t need t o r e l e a s e t h e memory u s e d by t h e bm o b j e c t .
110 // s i n c e i t was n o t c r e a t e d by a new command , t h e r e f o r e , t h e e x i s t e n c e
of
111 // t h e o b j e c t i s bounded t o t h e s c o p e o f t h e f u n c t i o n ( method ) where i t
was
112 // i n s t a n t i a t e d . I n o t h e r words , when t h e main r o u t i n e w i l l f i n i s h , t h e
113 // memory w i l l a u t o m a t i c a l l y r e l e a s e d . A t t e n t i o n : i t i s n o t t h e same
114 // when t h e o b j e c t i s i n s t a n t i a t e d by means o f a new command .
115 //
116 // Enjoy !
117 // JcM
The first consideration of this code is we use exceptions to catch the problems
that the method may have when running. In object oriented programing, an
exception is defined as any condition that may produce an error in the execution
of the code. These exceptions can be “caught” and handled by the user (who
is programming the applicaiton) in order to avoid the error in the execution.
In this way, all the code inside the clause “try” will be “catcheable” and by
polymorphism we identify the kind of the exception that was triggered from the
methods used within the “try” scope. In particular, in the bisection method
we have two important problems that need to be caught: 1) the algorithm
19
do not converge, and 2) the initial parameters are incorrect. In this way, the
implementation of the bisection method should evaluate these conditions and
throw the appropriate exceptions, allowing the programmer to handle these
situations. For that, we define as divergence exception when the algorithm
iterates more than 1e9 times.
The second consideration is about the design of the BisectionMethod class.
We define as interface three methods: start(), existRoot() and getRoot(). The
first one begins the iteration process with the parameters given at the construc-
tion of the object (the constructor receives the parameters to run). The second
method returns whether a solution has been found or not. And the third method
returns the solution found. For simplicity, we do not handle the case when the
programmer invokes this method and there is no solution found. But, this case
can be easily handled by defining a new exception.
In summary, we identified two main objects to be implemented: the Bisec-
tionMethod and the Exceptions. The BisectionMethod method acts as a solver
and the exceptions aid the user to handle the possible problems that may oc-
curs with the solver. In Listing 2, the interface of the class BisectionMethod is
defined.
Listing 2: Interface of the bisection method class (BisectionMethod.h)
1 #i f n d e f BISECTIONMETHOD
2 #d e f i n e BISECTIONMETHOD
3
4 /∗
5 ∗ I n t e r f a c e of the BisectionMethod c l a s s
6 ∗
7 ∗ Juan−C a r l o s M a u r e i r a
8 ∗ NLHPC − CMM
9 ∗ U n i v e r s i d a d de C h i l e
10 ∗ March , 2011
11 ∗/
12
13 #include <math . h>
14 #include " B i s e c t i o n M e t h o d E x c e p t i o n s . h "
15
16 #d e f i n e MAX ITERATIONS 1 e9
17
18 class BisectionMethod {
19 private :
20 double e p s i l o n ;
21 double a 0 ;
22 double b 0 ;
23
24 double r o o t ;
25 bool r o o t f o u n d ;
26
27 public :
28 // c o n s t r u c t o r
29 B i s e c t i o n M e t h o d ( double a 0 , double b 0 , double e p s i l o n ) ;
30
31 // No need o f d e s t r u c t o r ( no o b j e c t s c r e a t e d w i h t i n a
32 // i n s t a n c e o f t h i s c l a s s
33
34 /∗ s t a r t t h e method ∗/
35 void s t a r t ( ) ;
36
37 /∗ r e t u r n i f a r o o t has been f o u n d o r n o t ∗/
38 bool e x i s t R o o t ( ) ;
39
20
40 /∗ g e t t h e r o o t when i t exists , otherwise i t throw an e x c e p t i o n ∗/
41 double g e t R o o t ( ) ;
42 };
43
44 #e n d i f
21
46
47 while ( k < MAX ITERATIONS) {
48
49 x k = (a k + b k) / 2.0;
50 i f ( f a b s ( f ( x k ) ) < t h i s −>e p s i l o n ) {
51 // s o l u t i o n f o u n d
52 t h i s −>r o o t = x k ;
53 t h i s −>r o o t f o u n d = true ;
54 return ;
55 }
56 i f ( f ( x k ) ∗ f ( b k ) < 0) {
57 a k = x k;
58 } else {
59 b k = x k;
60 }
61 }
62
63 // MAX ITERATIONS r e a c h e d . t h e method do n o t c o n v e r g e
64 throw D i v e r g e n c e E x c e p t i o n ( ) ;
65 }
66
67 /∗ r e t u r n i f a r o o t has been f o u n d o r n o t ∗/
68 bool B i s e c t i o n M e t h o d : : e x i s t R o o t ( ) {
69 return t h i s −>r o o t f o u n d ;
70 }
71
72 /∗ g e t t h e r o o t when i t e x i s t s , o t h e r w i s e i t throw an e x c e p t i o n ∗/
73 double B i s e c t i o n M e t h o d : : g e t R o o t ( ) {
74 return t h i s −>r o o t ;
75 }
The exception classes defined for this application are inherited from the
C++ exception model. This design decision is considered “a best practice”
when using exceptions in C++. However, the user may implement its own
exception model. Normally, the method what() should return the reason for the
exception or any important value required to handle it in the “catch” scope.
Notice that we use polymorphism to identify the type of exception. Therefore,
the “catch” instruction shall receive a reference of a simple exception object,
and the “catch” argument will discriminate the object by the realization of it
(instanciation). Listing 4 presents the implementation of these classes. We join
the interface with the implementation without loss of generality of the code.
Listing 4: Implementation of the exception classes (BisectionMethodExcep-
tions.h)
1 #i f n d e f BISECTIONMETHODEXCEPTIONS
2 #d e f i n e BISECTIONMETHODEXCEPTIONs
3
4 /∗
5 ∗ B i s e c t i o n Method E x c e p t i o n s s e t
6 ∗ I n h e r i t e d from t h e C++ e x c e p t i o n c l a s s
7 ∗
8 ∗ More i n f o about C++ e x c e p t i o n s
9 ∗ h t t p : / /www. c p l u s p l u s . com/ doc / t u t o r i a l / e x c e p t i o n s /
10 ∗
11 ∗ Juan−C a r l o s M a u r e i r a
12 ∗ NLHPC − CMM
13 ∗ U n i v e r s i d a d de C h i l e
14 ∗ March , 2011
15 ∗/
16
17 #include <i o s t r e a m >
18 #include <e x c e p t i o n >
22
19
20 c l a s s I n v a l i d I n t e r v a l E x c e p t i o n : public s t d : : e x c e p t i o n {
21 public :
22 v i r t u a l const char ∗ what ( ) const throw ( ) {
23 return " I n v a l i d i n i t i a l i n t e r v a l " ;
24 }
25 };
26
27 c l a s s D i v e r g e n c e E x c e p t i o n : public s t d : : e x c e p t i o n {
28 public :
29 v i r t u a l const char ∗ what ( ) const throw ( ) {
30 return " T h e m e t h o d d o n o t c o n v e r g e " ;
31 }
32 };
33 #e n d i f
Once these codes are written in the Levque Cluster, the user only must issue
the command “make” to start the compilation process. We highlight that the
internals of a compilation process is very complex process and it can fail due to
multiple reasons. The user must adquire experience in compiling code in order
to deal with compilation errors.
After a successful compilation process, the generated target binary is called
“bisection”. As already mentioned, this binary application receives three ar-
guments, a0 , b0 (the interval to explore), and the tolerance . To run this
application, we must create an execution script. This script must define the job
properties as well as the instructions to prepare the required input parameters
and handling the output of the execution. For that, we create an job array
execution script according to the section 4.3.2. We define a job array from 1
to 500 with steps of 1 to compute the tolerance = 100∗SGE 1T ASK ID . Then,
we execute the application redirecting the output (stdout) to a temporal file in
23
order to parse it afterwards extracting the solution found. An important remark
when using job arrays and temporal files is this file should not have the same
name, otherwise, results will be overwritten by the sibling jobs within the array.
Therefore, we use a random temp file in order to avoid this problem. When
using this trick, it is important to remove the temp file before exiting from the
script to avoid a pollution of temp files within the computing area. In Listing
6 the execution job script for the bisection application is presented.
Listing 6: Job Script for the bisection method application (bisection job.sh)
1 #! / b i n / bash
2 # SGE Job Array d e f i n i t i o n
3 #$ −cwd
4 #$ −j n
5 #$ −n o t i f y
6 #$ −M jcm@dim . u c h i l e . c l
7 #$ −m a b e s
8 #$ −N b i s e c t i o n
9 #$ −S / b i n / bash
10 #$ −q a l l . q
11 #$ −t 1 −500:1
12
13 # b i s e c t i o n method s t u d y s c r i p t
14 SOLVER=./ b i s e c t i o n
15
16 # initial conditions
17 a 0=0
18 b 0 =100
19
20 # compute t h e e p s i l o n t o e v a l u a t e
21 e p s i l o n =‘echo " s c a l e = 1 0 ; 1 / ( 1 0 0 * $ S G E _ T A S K _ I D ) " | bc ‘
22
23 # c r e a t e a temp f i l e ( i n t h e /tmp ) t o g r a b t h e s o l v e r o u t p u t
24 OUTPUT=‘mktemp ‘
25
26 # evaluate the e p s i l o n
27 $ {SOLVER} $ a 0 $ b 0 $ e p s i l o n > $OUTPUT
28
29 # p a r s e t h e o u t p u t t o g e t t h e computed r o o t
30 # t h e r o o t i s i n t h e l a s t l i n e , 9 th t o k e n
31 X=‘ cat $OUTPUT | t a i l −n 1 | awk ’ { p r i n t $9 } ’ ‘
32
33 # p r i n t the r e s u l t s
34 echo $ e p s i l o n $X
35
36 # p l e a s e , remember t o remove t h e temp o u t p u t file
37 rm $OUTPUT
38
39 #EOF
The previous execution script write the output of the application in the
stdout (see main.cpp cout instructions). Thus, as the workload system redirects
the output of the job to an output file, for this example the output file is
called bisection.o plus the Job ID. As this job is a job array, each execution
within the job array will be identified by adding the number of the job to the
extension of the output file. In this way, when submitting our job script, we
should have 500 outputs files called “bisection.oPID.x”, being PID the Job ID
and x ∈ [1, 500].
As we parse the output of the binary application in order to print the infor-
mation we need, each output file will contain a single line with the value and
24
the solution for f (x) = 0. Therefore, our post processing script should gather
all these output files, build a table ordered by value, and create a plot of
versus the numerical solution for x. For that, we relay on the gnuplot gnuplot
application. As the Levque Cluster does not provide visualization services, the
user should generate all the graphical results in the form of output files. Thus,
we create a gnuplot script that gather the results files and write the plot in an
Encapsulated Postscript File (EPS). Listing 7 shows this post processing script.
Notice that we also plot the theoretical solution x = 1 to compare the results.
25
Figure 5: Numerical solution for f (x) = 0 using the bisection method for several
tolerance values
the join of process is more easy to implement. But, when solving this problem
assuming that each child process is running in a diferent CPU (or even different
hosts), the use of MPI is required. There are many ways to implement the fork
and join problem by using MPI. Here, we show one of the simplest one.
In a glimpse of MPI, an application is executed in different CPUs (or nodes)
at the same time. Conversely to forks or threads, the MPI process begins de-
tached from the beginning (there is no parent process or explicit forks). So,
each process is identified by a Process ID, or PID, and the MPI library provides
methods to communicate these process in different ways. The communication
between process is classified in blocking and non-blocking communication, mean-
ing when a process is receiving a communication, the execution is blocked or
not due to the receiving instruction. Also MPI provides many communication
paradigms, such as unicast (one to one), broadcast (one to many), collective
(many to one), among others.
For this example, we use a blocking unicast communication to wait for the
other process to finish. In other words, we assign the role of master to a single
process, and the rest are considered as slaves. As the parallel application
begins its execution in a detached state, the master process will wait for each
slave process to communicate its ending, and each slave process will perform
some calculation and then return the result before ending. So, for n process,
the master should receive n − 1 communications before continue its execution.
Implementing this example in C++, we propose an object oriented design
26
which is based in an abstract MPI process. As this class cannot be instantiated
(because it is abstract), we use the Factory pattern [2] to create instances of an
MPI process. This factory delivers two kinds of MPI process, the master and
the salve process. So, the Factory object has the responsibility of checking the
PID of each process and deliver the correct MPI process instance (master for
PID 0 and slave for PID¿0). The main execution routine only commands the
MPIProcess instance to begin its execution, and by polymorphism, the applica-
tion runs the appropriate run method. Therefore, we have two derived classes:
the MPIMasterProcess and the MPISlaveProcess. We use this trick to avoid
implementing the logic of the application in the main execution routine. Re-
member that it is executed on each CPU (or host), therefore, we have a cleaner
and understandable code instead to have a large main routine implementing
everything. The Listing 8 shows this main routine.
Notice that each process must initializate the MPI environment (MPI::Init)
and finalize it when exiting (MPI::Finalize). The code running on each CPU
or host is the same: get a MPIProcess instance and call its run method. The
MPIObjectFactory::getMPIProcess() method returns the appropriate instance
of the MPIProcess according the rank (PID) of each process (master for rank
0 and slave for rank ¿ 0). As both methods are inherited from MPIProcess,
the polymorphism will do “the magic” to invoke the correct implementation
of the run method. Before to discuss the implementation of each MPIProcess,
Listening 9 shows the Factory pattern implemented for the MPIProcess abstract
class.
27
Listing 9: The MPI Object Factory (MPIObjectFactory.h)
1 #i f n d e f MPIOBJECTFACTORY
2 #d e f i n e MPIOBJECTFACTORY
3
4 #include " m p i . h "
5 #include " M P I M a s t e r P r o c e s s . h "
6 #include " M P I S l a v e P r o c e s s . h "
7
8 c l a s s MPIObjectFactory {
9 private :
10 s t a t i c MPIProcess ∗ c r e a t e M a s t e r P r o c e s s ( ) {
11 return new MPIMasterProcess ( ) ;
12 }
13 s t a t i c MPIProcess ∗ c r e a t e S l a v e P r o c e s s ( ) {
14 return new M P I S l a v e P r o c e s s ( ) ;
15 }
16 public :
17 s t a t i c MPIProcess ∗ g e t M P I P r o c e s s ( ) {
18 i n t rank = MPI : :COMMWORLD. G e t r a n k ( ) ;
19 i f ( rank == 0 ) {
20 return c r e a t e M a s t e r P r o c e s s ( ) ;
21 }
22 return c r e a t e S l a v e P r o c e s s ( ) ;
23 }
24 };
25
26 #e n d i f
Notice that the method getMPIProcess() has the intelligence to return the
appropriate MPIProcess instance according to the rank of the MPI process.
Another important remark is this factory object is static. More details about
why it must be static and other examples can be found in [2].
Now, we define the interface of the MPIProcess abstract class. This class
is considered abstract since it has a virtual method that is not linked to its
implementation (the “=0” in line 18). Notice also that this class provides the
MPI functionality to get the rank of the process, send or receive data, etc.
All the common implementation between the master and slave process must
be implemented in this class. Otherwise, the user should implement each one
by separate on each derived class, increasing the maintenance cost of the code.
Always is better to factorize the common methods in a ancestor class, as in
this case, the MPIProcess. As the run() method is abstract, the user is forced
to implement it when deriving from MPIProcess. Otherwise, the compilation
process will throw an error when linking the code. Listing 10 and 11 present
the code for the interface definition and the implementation for the MPIProcess
class.
Listing 10: Interface for the MPI Process base class (MPIProcess.h)
1 #i f n d e f MPIPROCESS
2 #d e f i n e MPIPROCESS
3
4 c l a s s MPIProcess {
5 protected :
6 i n t rank ;
7 public :
8 MPIProcess ( ) ;
9 ˜ MPIProcess ( ) ;
10
11 i n t getRank ( ) ;
28
12 int getProcessNumber ( ) ;
13
14 void s e n d ( i n t d s t , i n t d a t a ) ;
15
16 void r e c v ( i n t d a t a ) ;
17
18 v i r t u a l void run ( ) =0;
19
20 };
21
22 #e n d i f
Listing 11: Implementation for the MPI Process base class (MPIProcess.cc)
1 #include " m p i . h "
2 #include " M P I P r o c e s s . h "
3
4 MPIProcess : : MPIProcess ( ) {
5 t h i s −>rank = MPI : :COMMWORLD. G e t r a n k ( ) ;
6 }
7
8 MPIProcess : : ˜ MPIProcess ( ) {
9
10 }
11
12 i n t MPIProcess : : getRank ( ) {
13 return t h i s −>rank ;
14 }
15
16 i n t MPIProcess : : g e t P r o c e s s N u m b e r ( ) {
17 return MPI : :COMMWORLD. G e t s i z e ( ) ;
18 }
19
20 void MPIProcess : : s e n d ( i n t d s t , i n t d a t a ) {
21 MPI : :COMMWORLD. Send(& data , 1 , MPI INT , d s t , NULL) ;
22 }
23
24 //TODO: u s e t e m p l a t e t o r e c e i v e a l l t h e p r i m i t i v e t y p e s o f d a t a
25 void MPIProcess : : r e c v ( i n t d a t a ) {
26 MPI : :COMMWORLD. Recv(& data , 1 , MPI INT , MPI ANY SOURCE, NULL) ;
27 }
Now, the foundations of our MPI application are settled. From here, the
user can implement any MPI process deriving from this base object model. As
this example aims to solving the join and fork problem, we implement a master
and slave process inherited from this object model. Thus, the master process
waits for the n − 1 slave process to finish and then return the execution to
the main thread (the main.cpp in the rank 0). Listing 12 and 13 presents the
MPIMasterProcess class interface and implementation. Notice the only method
implemented is the run(), since all the rest is inherited from the MPIProcess
base class.
29
9 v i r t u a l void run ( ) ;
10 };
11
12 #e n d i f
The logic of the master process is implemented in the run() method. Notice
that it waits by means of the blocking method recv() the arrival of the response of
a slave process, and then it counts plus 1 to the overall count of ended process
(count variable). We say “blocking” since the implementation of the recv()
method uses the Recv method of MPI, which is blocking by default. When
the ended process count reaches the total number of process, not counting the
master one, the master run() method ends.
For the MPISlaveProcess, we follow the same idea, but we implement the
run() method according the slave process. Listing 14 and 15 shows the interface
and the implementation of the MPISlaveProcess class.
30
Listing 15: Implementation for the MPI Slave Process (MPISlaveProcess.cc)
1 #include <c s t d l i b >
2 #include <i o s t r e a m >
3 #include <ctime>
4 #include " MPISlaveProcess .h"
5
6 MPISlaveProcess : : MPISlaveProcess ( ) {
7 s t d : : c o u t << " M P I S l a v e P r o c e s s C o n s t r u c t o r " << s t d : : e n d l ;
8 }
9
10 void M P I S l a v e P r o c e s s : : run ( ) {
11 s t d : : c o u t << " r u n s l a v e p r o c e s s " << s t d : : e n d l ;
12
13 // do s o m e t h i n g
14 s r a n d ( t h i s −>getRank ( ) ) ;
15
16 i n t wt = ( rand ( ) % 1 0 ) + 1 ;
17 s t d : : c o u t << " p r o c e s s " << t h i s −>getRank ( ) << " w a i t i n g f o r " << wt
<< " s e c o n d s " << s t d : : e n d l ;
18 s l e e p ( wt ) ;
19
20 s t d : : c o u t << " p r o c e s s " << t h i s −>getRank ( ) << " n o t i f y i n g m a s t e r a n d
e x i t i n g " << s t d : : e n d l ;
21
22 t h i s −>s e n d ( 0 , 1 ) ;
23
24 // c l e a n e v e r y t h i n g and r e t u r n
25
26 return ;
27 }
For this example, the run() method will wait a random time before noticing
the ending of the process to the master and return. Note that we seed the ran-
dom number generator with the rank of the process in order to get a different
random stream of numbers within each process, otherwise, all of them will gen-
erate the same stream of random numbers, which is not useful for our purpose.
Also note the method uses the send() method to inform the master (rank 0)
that the slave process has ended. For that, we send a number 1 to the master.
Why 1?? indeed, it does not matter, it can be any integer number. For sending
a float or any other primitive type of data, the MPIProcess must implement the
send and receive methods for those types of data. The only implemented is the
integer one.
Now the code is complete and we need to compile it. We use a “Makefile” to
perform this task in a similar way than the previous example. However, we use
the mpic++ compiler wrapper to avoid adding the MPI libraries and includes
paths. This wrapper, commonly provided by MPI implementations, may differ
on its name, so the user must be careful when invoking the correct compilation
wrapper. For this example, we use the OpenMPI implementation, so, we load
the corresponding module before compiling the code as mentioned in Section
4.6 Listing 16 depicts how this Makefile looks like.
31
6 a l l : main
7
8 %.o : %. c c
9 $ (CXX) −c $<
10
11 . cpp . o :
12 $ {CXX} −c $<
13
14 main : $ (OBJECTS)
15 $ (CXX) ∗ . o −o $ (TARGET)
16
17 clean :
18 rm ∗ . o
19 rm $ (TARGET)
Notice the banner text is shown 5 times, indicating the main routine is
executed on each one of the 5 requested CPUs. After each MPI process is
created, our application will assign as master to the process rank 0 and slaves to
the rest of the process (rank¿0). Then, the run method is invoked. The master
32
waits for the children process and the slaves waits for a random time before
to noticing the master about its termination. When the last slave process has
ended, the main process finish its run method, returning the execution control
to the main routine.
When running this example as a job, we require a MPI job submission script.
As mentioned in Section 4.3, a parallel job script must request for a parallel
environment to use. As we compile out application by using the OpenMPI
library, we must run it under this parallel environment. Therefore, the job
script looks like the one shown in Listing 17.
The submission of this script is described in Section 4.4 and the results are
recovered in the same way than the previous example (the output file generated
by the workload system).
References
[1] Bash linux commands. http://ss64.com/bash/.
33
[2] Factory pattern. http://en.wikibooks.org/wiki/C++_Programming/
Code/Design_Patterns.
[3] Gnu make. an introduction to makefiles. http://www.apl.jhu.edu/Misc/
Unix-info/make/make_2.html.
34