Sie sind auf Seite 1von 34

MAJOR SCIENTIFIC AND TECHNOLOGICAL

EQUIPMENT FOR USER FACILITY CENTERS


Programa de Investigacin Asociativa (PIA) - CONICYT
National Laboratory for High Performance Computing (NLHPC) - ECM02

Levque Cluster User Manual


Juan-Carlos Maureira Claudio Baeza R Tomas Perez
August 28, 2011

1 Introduction
1.1 Context
The National Laboratory for High Performance Computing (NLHPC) project
aims to install in Chile a supercomputing infrastructure that would meet the
scientific domestic demand for high performance computing (HPC), offering
high quality services and promoting their use in problems of both basic and
applied research, and industrial applications. In recent years the development
of applied science and industry has been led by the sophisticated use of infor-
mation and communication technologies (ICT), a process in which HPC has
played an important role. In Chile, some areas of science as well as some in-
dustrial sectors have reached a level of maturity that, to maintain its global
competitiveness requires the use of technologies related to the HPC. Identifying
the opportunities that the availability of this technology will bring the country,
most of the research universities of Chile, led by the Center for Mathematical
Modeling (CMM), University of Chile (UChile) proposed to CONICYT the cre-
ation of the National Laboratory for Computer High Performance (NLHPC).
The NLHPC is created by the University of Chile as Sponsoring Institution and
as Associated Institutions (AI), the universities Pontificia Universidad Catlica
de Chile (PUC), Universidad Tcnica Federico Santa Mara (UTFSM), Universi-
dad de Santiago (USACH), Universidad de la Frontera (UFRO), Universidad de
Talca (UTalca) and Universidad Catolica del Norte (UCN), in association with
REUNA. The NLHPC supercomputing infrastructure will be composed by sev-
eral HPC clusters distributed among the members of this laboratory. All these
clusters will be connected via high speed networks provided by REUNA. The
central processing node it is hosted at the Center for Mathematical Modeling
(CMM), Faculty of Physical Sciences and Mathematics (FCFM) of the UChile,

1
a center of excellence in scientific research, with extensive experience in manag-
ing large collaborative projects. The present manual, describes the actual HPC
infrastructure of the CMM called “Levque”. This name comes from the word
thunder in Mapudungun, the language of the Mapuche, the main native nation
of Chile.

1.2 The Levque Cluster, the first cluster at the NLHPC


The Levque cluster is an IBM iDataplex machine with 536 compute cores ded-
icated to satisfy the demand for scientific computing at the Center for Math-
ematical Modeling (CMM), University of Chile. This machine, funded by the
BASAL project PFB-03, offers to CMM’s researchers a computing power of
about 6 TFlops. This computational power is achieved thanks to the combi-
nation of Intel Nehalem processors, a Qlogic Director Infiniband switch and an
appropriate software and hardware integration. This document is devoted to
explain how to use this machine. However, to help the users in the understand-
ing of its use, it is recommended to review its architecture and some internals
about how it operates.
In the following, the hardware architecture is showed (Section 2). In Section
3 we state the software stack available for users and, in Section 4 we cover the
basic aspects to use the Levque Cluster. Finally, in Section ?? we show two
examples of the construction of scientific applications in Levque.

2 Hardware
2.1 Cluster Architecture
The cluster architecture can be divided into three main areas: the Computing
Area, the Storage Area and the Administration Area. These areas are inter-
connected by means of two separated networks, each one playing different roles
within the HPC infrastructure. The computing area is used to perform the
scientific computations and it is composed by several compute nodes, each one
equipped with multicore CPUs, where users run their applications. The storage
area provides a scalable persistent layer to the data required by the comput-
ing area. The administration area is used to facilitate the interaction with the
computing area and to perform the monitoring of the whole infrastructure as
well as the correctness of the users’ jobs. These areas are bonded through the
interconnection network and the administration network. The former one is
used for computing purposes (I/O and IPC for example) and the latter one is
used to operate, maintain and monitor the HPC infrastructure.
In particular, the Levque Cluster architecture is composed by a computing
area of 67 nodes, representing 536 cores exclusively dedicated to run users’
jobs. The storage area is composed by 5 nodes, four dedicated only to I/O
operations summarizing 8 TB of available space, which is managed by one server
known as the meta-data server. Finally, the administration area is composed

2
(a) Interconnection Network (b) Ethernet Network

by four nodes: two acting as the head of the cluster (two master nodes in
fail over configuration) and two acting as the interface of the cluster (one for
users and one for grid computing). The interconnection network is a packet
switched network based on the Infiniband (IB) technology capable of reaching a
throughput of 40 Gb/s by port with a very low latency end-to-end. Each node
in the cluster is equipped with an IB Host Card Adapter (HCA) with two ports,
which are both connected to a switch capable of growing up to 432 Infiniband
ports (adding leaf modules). The administration network is composed by 5
Ethernet switches providing a link rate up-to 4 Gb/s (4 x 1Gb/s), from which
users can log into the cluster to run their jobs and recover their results. Figure
1a depicts the how the interconnection network bond the above defined areas
and Figure 1b depicts the administration network layout.
In the following, we introduce in detail each area, covering more technical
aspects of the above described architecture.

2.2 Computing Area


The computing area is composed by
66 compute nodes serving the users
requirements and one compute node

Figure 1: IBM dx360 M2 Compute


Node
for testing purposes. The first 66
compute nodes are equipped with two
quad-core Intel Xeon X5550 proces-
sors running at 2.67GHz (each com-
pute node has 8 cores) with 24 GB
RAM each one. They are monitored
and administrated through the Ether-
net network and interconnected with
the other clusters areas through the Infiniband network. We highlight that each
compute node has a dual port IB HCA, which are both connected. This means
that each node can establish a high speed communication (40 Gb/s) with two
computes nodes at the same time. The matlab compute node is equipped with
two quad core Intel Xeon E5520 processors running at 2.27 GHz with 24 GB of
RAM. This is equipped with the Ethernet network only.
In summary, the Levque cluster offers to users a theoretical compute power
of 6 Tflops and more than 1.5 TB of distributed RAM.

2.3 Storage Area


The Levque cluster implements a Lus-
tre Parallel Filesystem[4], which is
well known in the HPC community
as one of the best open source so-
lutions to provide a storage back-
end for HPC clusters. In partic-
ular, the Lustre architecture imple-
mented in Levque is composed by four
I/O servers, one meta-data server
and one backup server. The I/O
servers are accessed by the compute
nodes through the Infiniband net-
work. This interconnection network
ensures a high throughput when ap-
plications are performing input/out- Figure 2: IBM x3650 M3 Storage Server
put file system operations, since it is
capable of performing load balancing and stripping of data among the I/O
servers. Each one of these servers is equipped with 5 hard disk of 500 GB,
providing an aggregated storage space of 8 TB with a n+1 redundancy of data.
The addressing and allocation of space within this storage area is managed by
the Lustre meta-data server, which knows the location and redundancy of each
object1 stored on the pool of disks.
1 Lustre is an object based parallel file system

4
2.4 Administration Area
The administration area is composed by two parts: the administration nodes
and the administration network. The first one consists in two master nodes,
one active and one passive in fail over configuration, and a frontend node, from
which users can interact with the cluster. Both nodes are equipped with two
quad core Intel Xeon E5540 processors @ 2.67 Ghz with 28 GB RAM. The front
end node is reachable through the hostname development.dim.uchile.cl. The
second part is composed by 5 Giga-Ethernet switches, each of of 48 ports each
one. They are interconnected in such a way to provide a redundant connectivity
with the computing and storage area. Also the monitoring tasks are performed
over this network, which are aggregated by the master node of the cluster. It is
important to remark that also the master node is monitoring the power supply
facility which is providing up-to 60 KVA of electrical power to keep the whole
HPC infrastructure always up and running.

2.5 Networking
The Levque cluster uses two interconnection networks: a high speed network
called Infiniband and an Ethernet network. The former one is used to commu-
nicate the compute nodes for calculations and I/O purposes, and the latter one
is used for administration and user interaction.
The Infiniband network used by
Levque is connected by means of a
QDR 100% no blocking fabric switch
providing a high bandwidth of 40
Gb/s per port and a latency of ap-
prox. 100 nsec. Through this net-
work, the compute nodes are allowed
to use a message passing communi-
cation library (MPI) as well as IP
(layer 3) services. In addition, the
Levque IB network provides the Re-
mote Direct Memory Access (RDMA)
method which gives users the ability
to communicate different process in
a memory-to-memory access scheme
(almost similar to a shared memory
system).
The Ethernet network is a 1 Gib/s
network and exhibit latencies about
1 ms at layer 3, mostly generated
by the OS network stack. This net-
work is used to interact with the clus-
ter through a frontend (login) node,
Figure 3: QLogic 12800-180 Infiniband
which is accessed via a secure shell
Switch

5
communication protocol (ssh). This
interaction includes command shell
and file operations at the moment (no visualization services are provided).

3 Software
The software loaded into the Levque cluster can be divided in two categories: 1)
administrative software and 2) scientific software. The administrative software
is divided in three areas: a) base software, b) development tools and c) libraries.
The scientific software is divided in two areas: d) licensed software and e) open-
source software. The licensed software has a restricted use due to the license
agreement. For further details on how to use your own licensed software, please
contact the NLHPC personnel. The opensource software available at Levque
can be used by anyone.
In the following, each area is listed with the version of the software and their
respective environment where they are available. The environment settings are
detailed in the Section 4.6. It is worth to mention that, due to space reasons,
only the most important ones are listed.

3.1 Administration Software


• Base Software

Operating system Linux CentOS 5.5


Glibc 2.5
Compat glibc 2.3.4
Linux kernel version 2.6.18-194
Workload system Sun grid engine 6.2u5
Monitoring Ganglia 3.1.7
http://monitoring.dim.uchile.cl/ganglia
File System Lustre 1.8.5

• Developing Tools

• Libraries

3.2 Scientific Software


• Licensed (restricted use due to license agreements)

• OpenSource

6
Name Version Modulefile or HomePath
Gnu compilers 4.1.2 /usr/bin
4.4.0 /usr/bin
Intel compilers 11.1.072 /opt/intel/Compiler/11.1/072
PGI compilers 10.9 /opt/pgi/linux86-64/10.9
Openmpi with gnu 4.1.2 1.4.2 openmpi/1.4.2
1.4.3 openmpi/1.4.3
Openmpi with gnu 4.4.0 1.4.3 openmpi gcc44/1.4.3
Openmpi with intel 11.1 1.4.3 openmpi intel/1.4.3
Openmpi with pgi 10.9 1.4.3 openmpi pgi/1.4.3
Valgrind 3.5.0 /usr/bin
Boost 1.33.1 /usr/bin
Java 1.6.0 /usr/bin
Python 2.4.3 /usr/bin
2.6.6 python/2.6.6
Perl 5.8.8 /usr/bin/

Name Version Modulefile or HomePath


Atlas 3.6.0 /opt/atlas-3.6.0
FFTW2 2.1.5 Intel version /opt/fftw-2.1.5 intel
FFTW3 3.2.2 single precision /opt/fftw-3.2.2-single
3.2.2 double precision /opt/fftw-3.2.2-double
NetCDF 4.1.1 /opt/netcdf-4.1.1
HDF5 1.8.6 /opt/hdf5-1.8.6
GRIB 1.9.9 /opt/grib-1.9.9
GSL 1.14 /opt/gsl-1.14
MKL 11.1.072 /opt/intel/Compiler/11.1/072/mkl

4 Interaction with the Levque Cluster


The interaction of the user with the software mentioned in the previous section
is described in Figure 4. The user must access the Levque Cluster by means of
a login operation, which is described in Section 4.1. Then, prepare a submission
script (Section 4.3), where the required resources are defined and the execution
script is provided. When the script is validated by the user for correctness, it
must be submitted to the workload system (Section 4.4), which will queue the
job to be executed in the computing area. Furthermore, the user must monitor
the job state in order to verify its execution and completion (Section 4.5). When
the script finish its execution, the results can be obtained from the storage area.

4.1 Accessing the Cluster


One important prerequisite to interact with the Levque Cluster is to have a good
knowledge of the *NIX operating systems. In particular, experience in Linux is
desirable since the interaction with the machine is console (text) based, not re-

7
Name Version Modulefile or HomePath
Cplex 12.2.0 cplex/12.2.0
12.1.0 cplex/12.2.0
9.1.3 cplex/9.1.3
Stata 11.0 stata/11.0
Matlab2 7 /opt/matlab7
Gaussian 09B01 gaussian/09B01
Fluent 12.0 fluent/12.0
12.1 fluent/12.1
Comsol multiphysics 4 comsol/4.0
Knitro3 6.0 /opt/knitro/knitro-6.0.0-student
7.0 /opt/knitro/knitro-7.0.0-z

Name Version Modulefile or HomePath


Octave 3.0.5 octave/3.0.5
Namd 2.7 namd/2.7
Gromacs 4.5.3 single precision gromacs single/4.5.3
4.5.3 double precision gromacs double/4.5.3
FDS 4 5.5.3 /opt/FDS
QTsdk 2010.04 /opt/qtsdk-2010.04
Hmmer3 3.0 /opt/hmmer3
ABYSS 2.0 /opt/abyss-2.0
Mpiblast 1.6.0 mpiblast/1.6.0
WRF 3.2 wrf/3.2
Polyphemus 1.8.1 polyphemus/1.8.1

laying on any graphical interface. Nowadays there is only one way to access the
Levque Cluster. However, in the near future there will be three ways to access
it. Therefore, at the moment (March, 2011) the access is made via the frontend
node development.dim.uchile.cl, and to access this node it is mandatory to
have an account in the Levque Cluster. The login operation is made through the
SSH protocol, therefore, it is necessary to have a client application supporting
the ssh protocol. There are several applications for different operating systems.
For Linux, the ssh client is part OpenSSH package. For Windows, it is possi-
ble to download a SSH client from the URL http://www.putty.org/. And for
MAC OS, the following URL http://openssh.com/macos.html describes several
alternatives to have a SSH client.
Once logged onto the Levque cluster (the frontend node), the user interacts
with the operating system as usual in any *NIX operating system. In this way,
the user prepares the execution of its scientific application by means of a script
which is submitted to a workload system. This workload system schedules
the execution as a job. It is important to highlight that it is completely
forbidden to for long times run scientific applications in the frontend
node. We understand as long time more than 1 minute. For testing purposes,
users may run for short times their applications to ensure their correct execution.

8
Figure 4: User interaction diagram

However, for validation of results (which normally requires longer execution


times), the execution must be performed in the compute area.

4.2 Running Applications


The execution of applications is performed through a workload system by means
of a job file (or script to be precise). A job is defined as the execution of a script
subject to a set of resources which are demanded by the user and administrated
by the workload system. These jobs are executed in the compute area, which
can not be accessed directly by the user. All the interaction with the compute
area is made through the workload system (to submit jobs) and through the
storage area (where the input and output data is stored). Each job as a Job
ID or PID, which is unique and is used to perform operations over the job. The
workload system generates one or two output files by job, depending whether

9
the stdout and the stderr are consolidated in a single file (detailed later on).
This output file is called as the name of the job with an extension composed
by the letter “o” or “e” plus the Job Id assigned by the workload system. For
instance, for a job called “test” and a Job ID 33530, the output files will be
“test.o33530” and “test.e33530”, assuming both stdout and stderr are not joint.
In general, the steps to execute an application in the Levque Cluster are the
following:

1. Build a job script.


2. Submit the job script to the workload system.
3. Monitor and/or manage job through the monitoring system while the job
is executed.
4. Once the job is finished, recover the results from the storage area.

All the technical aspect related to the job submission, scheduling and exe-
cution on the Levque Cluster is controlled by the Oracle Grid Engine or OGE
software[6] (formerly known as Sun Grid Engine or SGE). The OGE allows the
limitation of the total number of job/slots for each user running simultaneously
on the cluster. However, the user may submit as much jobs as needed with the
constraint that they will be queued and waiting for available resources to be
executed. The policies to manage the sharing of resources among users are de-
fined by Scientific Strategic Committee (SSC) of the NLHPC and implemented
at the OGE.
Notice that the HPC resources (number of cores, wall/user time, storage
space, RAM, etc) are assigned to users obeying to a policy of fair use. In the
near future this policy will be changed for open calls were users can request HPC
resources. Through these calls, twice a year, users requesting HPC resources
will submit their research proposals to the NLHPC. An external committee will
review the merit of the proposals, ranking them to be presented to the SSC.
Further details on this Call for Proposal will be available in the NLHPC web-
site. Also, the SSC and/or NLHPC Executive Committee may assign different
priorities to users according the strategy plan of development of the NLHPC.
So, the workload system will be not a FIFO queue (First In, First Out), but
there will be a formula in the scheduling task to calculate the position in the
queue.

4.3 Building a Job Script


A job script is a plain text file that has instructions for the workload system
specifying how to run an application in the computer area. In the ”/home-
/shared/sge” there are many examples of scripts that can be used as templates
to prepare new job scripts.
1 #! / b i n / c s h
2 #$ −cwd

10
3 #$ −j n
4 #$ −n o t i f y
5 #$ −M c l a u d i o . b a e z a . r@gmail . com
6 #$ −m a b e s
7 #$ −N t e s t
8 #$ −S / b i n / c s h
9 #$ −q a l l . q
10 source / e t c / p r o f i l e . d/ modules . c s h
11 module l o a d p g i
12 . / myprogram

The example above illustrates an example of a job script to run a parallel


application. All the lines beginning with #$ are directives which are interpreted
by the OGE workload system, and each line specify a different property or
resource to run the application:

• Line 1: defines the command shell interpreter that is used to execute the
script when running it outside the OGE environment.
• Line 2: use the current directory as the working directory.
• Line 3: “n” for not and “y” for yes. Join or do not join the output/error
messages into a single file.
• Line 4: Notification before the end of the job. The OGE will send a
SIGUSR2 signal to your application, 60 seconds before the time specified
by -l h rt option, to notify it before killing it, to allow it to perform some
cleanup and to save results before loosing everything. Our application
should then intercept this signal, otherwise -notify option would have no
effect.
• Line 5: Defines an email where any notification about the job state will
be sent.
• Line 6: This option is used together with the previous option and defines
the event by which the email will be sent.
– b: at the beginning of the job.
– e: at the end of the job.
– a: when the job is aborted or rescheduled.
– s: when the job is suspended.
• Line 7: This option defines the name of the job. It is used when displaying
the list of running jobs.
• Line 8: Specifies the command interpreter (shell) that will be used to
execute the job inside the OGE environment.
• Line 9: Indicates the queue in which the job will be executed.

11
• Line 10: Here the execution script begins. The language to use must agree
with the shell interpreter specified at the line 8.

There are many ways to configure a job script file. But, there are 3 classic
jobs types that are implemented by the OGE: the standard job, the parallel job,
the job array. The first one is meant to execute a single task in a single CPU
(with the memory limitations of a single node). The parallel job and the array
job defines more complex scenarios.

4.3.1 Parallel Job


A parallel job is used to execute a parallel application which is executed among
several cores (and nodes) as a single application. For this purpose, a parallel
environment must be defined in order to provide the communication and syn-
chronization functions required by the parallel application. These environments
are provided by a framework which is implemented as a library at which the
parallel algorithm is compiled against. The most popular parallel framework is
the MPI[5] (Message Passing Interface) library. It defines all the methods and
data required to implement a parallel algorithm into an application. There are
several implementations of MPI, from which several parallel environments are
defined. The Levque Cluster provides the following parallel environments:

1. openmpi : To run application based on openmpi flavor of MPI.


2. openmp: To run application based on shared memory using OpenMP
directives.
3. mpich2: To run application based in mpich2 flavor of MPI.
4. gaussianET: To run gaussian09 using Linda over Ethernet.
5. gaussianIB: To run gaussien09 using Linda over infiniband.
6. fluentP: To run Ansys Fluent 12.1 using openmpi.

To run a parallel application, the user must use a determined parallel envi-
ronment according the previous list. This parallel environment is defined in the
job script as it is depicted in the following example:
1 #! / b i n / c s h
2 #$ −cwd
3 #$ −j n
4 #$ −n o t i f y
5 #$ −M c l a u d i o . b a e z a . r@gmail . com
6 #$ −m a b e s
7 #$ −N t e s t
8 #$ −S / b i n / c s h
9 #$ −pe openmpi 64
10 #$ −q a l l . q

12
11 source / e t c / p r o f i l e . d/ modules . c s h
12 module l o a d o p e n m p i i n t e l
13 mpirun m y p a r a l l e l p r o g r a m

Notice the line 9, where the directive “-pe” is used to define the execution of
the “myparallelprogram” under the “openmpi” environment with 64 cores. The
allocation policies of process within an environment is defined by the OGE.

4.3.2 Job Arrays


A Job Array is a script capable of running multiple times, having each exe-
cution a environmental variable differentiating one execution from the others.
This variable is unique and named SGE TASK ID. The value of this variable
is determined by the directive -t start-end:step. Now, this value can be used to
determine a parametric study, assigning to the parameter a different value by
execution. For this, the user must transform the SGE TASK ID into the pa-
rameters to study. Mappings or simple shell computation can be used to achieve
this mapping between SGE TASK ID and the parameter to study.
1 #! / b i n / b a s h
2 #$ −cwd
3 #$ −j n
4 #$ −n o t i f y
5 #$ −M c l a u d i o . b a e z a . r@gmail . com
6 #$ −m a b e s
7 #$ −N t e s t
8 #$ −S / b i n / b a s h
9 #$ −t 40 −80:2
10 #$ −q a l l . q
11 source / e t c / p r o f i l e . d/ modules . c s h
12 module l o a d p g i
13 INPUT=‘echo $ {SGE TASK ID } ∗ 1 . 6 8 | bc ‘
14 OUTPUT=${SGE TASK ID}−out . dat
15 myprogram ${INPUT} $ {OUTPUT}

In the example showed above, the application “myprogram” is executed 20


times. These executions might be parallel or sequential, according to the avail-
ability of resources. Notice the line 9, the directive -t assign to the SGE TASK ID
variable the sequence a sequence starting at 40 up-to 80 with steps of 2 (40,42,44,46,...,78,80).
In this way, each execution, or better defined as an instance of a job, assign a
different value (from the sequence) to the SGE TASK ID variable. So, in the
example “myprogram” receives as argument:
1 myprogram 6 7 . 2 40−out . dat
2 myprogram 7 0 . 5 6 42−out . dat
3 ...
4 myprogram 1 3 1 . 0 4 78−out . dat

13
5 myprogram 1 3 4 . 4 80−out . dat

Under /home/shared/sge directory, you find several examples of job array


scripts.

4.4 Submitting Jobs


To submit a job to the workload system (OGE), the user must use the qsub
command. For example, for a job script called test.sh, the submission command
looks like:
1 [ c laudio @devel opmen t ˜ ] $ qsub t e s t . sh
2 Your j o b 33530 ( ” t e s t ” ) has been s u b m i t t e d
3 [ c laudio @devel opmen t ˜ ] $

Notice the command returns the Job ID assigned to the execution of the
test.sh script (in this case the PID 33530). For further information, use the
command man qsub.

4.5 Monitoring a Submitted Job


Each submitted job has a state in the workload system. The command to
obtain this state is called qstat. In the following example, there is a job called
“test”, submitted by the user “claudio” the 21 of march 2011, requesting only
1 CPU (slots) and identified by the PID 33530. Notice the state of this job is
queue-waiting.
[claudio@development ~]$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
---------------------------------------------------------------------------------------
33530 0.50500 test claudio qw 03/21/2011 22:18:15 1

The most common job states are : qw (queue-waiting), r (running), Eqw


(Error-queue-waiting) and s (suspended). However, there are many other stats
that can be found at the documentation of the qstat command (man qstat).
There are several options (flags) of the qstat command to monitor the queue
state. For instance, to see see how many cores, or slots are available on each
available queue, the command to issue is qstat -g c.
[claudio@development ~]$ qstat -g c
CLUSTER QUEUE CQLOAD USED RES AVAIL TOTAL aoACDS cdsuE
---------------------------------------------------------------------------------------------
all.q 1.00 508 0 12 528 0 8
fcab.q 0.01 0 0 2 2 0 0
matlab.q 0.01 0 0 6 6 0 0
[claudio@development ~]$

The example showed above indicates the load on each queue (CQLOAD),
the cores that are being in use (USED), the cores in reserved state (RES), the
cores available in the queue (AVAIL) and the total of cores (TOTAL). The two
following columns give the number of jobs that are in the states:

14
• a - Load threshold alarm
• o - Orphaned
• A - Suspend threshold alarm
• C - Suspended by calendar

• D - Disabled by calendar

or in the following states:

• c Configuration ambiguous

• d Disabled
• s Suspended
• u Unknown

• E Error

Others useful commands to track the state of the queue and the jobs are:

• Cause of the error in a job in the E state: qstat j jobID


• Information on the sub-jobs in a job array; qstat -t or qstat g t

• Status by sub-job in a job array; qstat g


• How many slots (cores) are busy by you: qquota
• Compute nodes state in the cluster: qhost

• Jobs by host: qhost -j

In addition, to manage the job in the workload system, there exist several
commands from which we mention only the most important ones according our
criteria.

• Deleting a job; qdel jobID


• Deleting all yours jobs; qdel u username
• Holding a job in qw state at the queue; qhold jobID
• Releasing a hold state Job in the queue; qrls jobID

Further information on each one of the commands mentioned in this section


can be found under the OGE User Manual (see references at the end of this
document).

15
4.6 Working Environments: the module utility
The module utility is a user interface that provides dynamic modification of
the user’s environment via command line. For instance, when handling several
versions of the same application or library. Each modulefile contains the informa-
tion needed to configure the environment for an application. Once the module
package is initialized, the environment can be modified on a per-module basis
using the module command which interprets modulefiles. Typically modulefiles
instruct the module command to alter or set shell environment variables such
as PATH, MANPATH, etc. For example, to know what modules are available;
the command to issue is module avail or module list.
[claudio@development ~]$ module list
Currently Loaded Modulefiles:
1) openmpi/1.4.3
[claudio@development ~]$ module avail

----------------------- /usr/share/Modules/modulefiles ----------------------------


dot module-cvs module-info modules null use.own

----------------------- /etc/modulefiles ------------------------------------------


cplex/12.1.0 gromacs_single/4.5.3 openmpi/1.4.3 openmpi_pgi/1.4.3
python/2.6.6 cplex/12.2.0 namd/2.7 openmpi_gcc44/1.4.3
opl/6.3 stata/11.0 cplex/9.1.0 openmpi/1.4.2
openmpi_intel/1.4.3 pgi/10.9
[claudio@development ~]$

To load a specific environment, the command line to issue is: module load
modulefile, being modulefile the name of the module (environment to be precise)
to load; and to unload the module, the command line is analogue.
[claudio@development ~]$ module load openmpi_intel/1.4.3
[claudio@development ~]$ module list
Currently Loaded Modulefiles:
1) openmpi_intel/1.4.3
[claudio@development ~]$ which mpicc
/opt/openmpi_intel/1.4.3/bin/mpicc
[claudio@development ~]$
[claudio@development ~]$ module list
Currently Loaded Modulefiles:
1) openmpi_gcc44/1.4.3 2) cplex/9.1.0
[claudio@development ~]$ module unload cplex
[claudio@development ~]$ module list
Currently Loaded Modulefiles:
1) openmpi_gcc44/1.4.3
[claudio@development ~]$

5 Building an Example Application


In this section we introduce two examples to illustrate the methodology of in-
teraction with the Levque Cluster: an application to run in a distributed way
and an application to run in a parallel way (MPI). The first one covers the
interaction from the application design up-to the generation of results and the
second one illustrates only how to build and execute a MPI application based on
a classic parallel problem. Both examples are made in C++ and require some
background on object oriented programing, some experience in compiling code

16
(Makefiles[3] for example) and bash scripting[1]. At the end of this document,
several documents are referenced as recommended readings to improve the user
understanding on these examples.

5.1 The bisection method study: an example of a dis-


tributed application
The objective of this example is to perform a parametric analysis of the bisection
method for the root finding problem in a polynomial function f (x). We focus
on the effect of the tolerance of the method, , in the quality of the solution. We
understand as quality how close is the numerical solution from the theoretical
one. For the sake of simplicity, we define as target function f (x) = x2 − 1,
which has a known root in x = 1. The methodology is to evaluate the bisection
method within a fixed interval [a0 , b0 ] for several  values. Thus, we determine
the value for  by which the method exhibits a better convergence (in quality)
to the theoretical solution. For that, we plot the solutions provided by the
method for all the evaluated . In this way we evidence the contrast between
the numerical and the theoretical solution.
The first step is to implement the bisection method in C++ and to create
an application (a binary file) containing this implementation. This application
receives as arguments the interval to explore and the value of . The second step
is to create a job array script to execute this application in a distributed way,
having each execution a different value for . In this script, we use the variable
SGE TASK ID to compute the  to be evaluated. The third step is to parse
the output of each execution in order to obtain the values for x where f (x) = 0
by each . We define the interval to explore as [0, 100], and we explore for 500
values with  = 100∗SGE 1T ASK ID . Once executed the 500 runs, the last step is
to construct a post-processing script to plot the results.
As mentioned before, we implement the bisection method in C++. For that,
we need an object oriented design for the final application. We follow a top-
down approach, designing first the application at high-level. Then, we identify
the objects defined at this level defining the associated classes. Following we
implement these classes and then, we compile the whole application by using a
makefile.
The file showed in Listing 1 implements the main function of the C++
application. This file is called main.cpp. The code is well documented between
instructions lines. However, there are some important considerations that we
discuss following the code.

Listing 1: The bisection method main application code (main.cpp)


1 /∗
2 ∗ B a s i c C++ A p p l i c a t i o n
3 ∗ Find t h e r o o t o f f ( x ) by t h e b i s e c t i o n method .
4 ∗ f ( x ) i s h a r d c o d e d t o make t h e example e a s i e r .
5 ∗ The p a r a m e t e r s a r e a 0 , b 0 and e p s i l o n ( t o l e r a n c e )
6 ∗
7 ∗ Juan−C a r l o s M a u r e i r a
8 ∗ NLHPC − CMM

17
9 ∗ U n i v e r s i d a d de C h i l e
10 ∗ March , 2011
11 ∗/
12
13 #include <i o s t r e a m >
14 #include <t i m e . h>
15 #include " B i s e c t i o n M e t h o d . h "
16
17 using namespace s t d ;
18
19 i n t main ( i n t a r g c , const char ∗ a r g v [ ] ) {
20
21 // c h e c k t h e arguments p r o v i d e d by t h e command l i n e
22 i f ( argc !=4) {
23 c o u t << " u s a g e : " << a r g v [ 0 ] << " a _ 0 b _ 0 e p s i l o n " << e n d l ;
24 return 9 ; // we a s s o c i a t e 9 t o t h e i n v a l i d arguments e r r o r
25 }
26
27 double a 0 = a t o f ( a r g v [ 1 ] ) ;
28 double b 0 = a t o f ( a r g v [ 2 ] ) ;
29 double e p s i l o n = a t o f ( a r g v [ 3 ] ) ;
30
31 if ( e p s i l o n <=0) {
32 c o u t << " e p s i l o n m u s t b e g r e a t e r t h a n 0 " << e n d l ;
33 return 9 ;
34 }
35
36 c o u t << " f i n d t h e r o o t o f f ( x ) = x ^ 2 - 1 b y t h e b i s e c t i o n m e t h o d w i t h
e p s i l o n = " << e p s i l o n << e n d l ;
37
38 // s t r a t e g y :
39 // 1 . we i n s t a n c i a t e t h e s o l v e r c l a s s ( B i s e c t i o n M e t h o d ) . I f t h e
40 // i n i t i a l i n t e r v a l i s i n v a l i d , we c a t c h t h e c o r r e s p o n d i n g
exception
41 // to inform the e r r o r .
42 // 2 . we s t a r t t h e i t e r a t i o n method . i f i t do n o t c o n v e r g e a f t e r
MAX ITERATIONS
43 // d e f i n e d by t h e s o l v e r c l a s s , we c a t c h t h e D i v e r g e c e E x c e p t i o n .
44 // 3 . I f we dont g e t any e x c e p t i o n , t h e method has f o u n d a s o l u t i o n
45 // i i f no o t h e r e x c e p t i o n i s t r i g g e r e d . we c a p t u r e t h o s e t o know
the
46 // p r o c e s s h as t e r m i n a t e d a n o r m a l l y
47 // 4 . we q u e r y t h e s o l v e r i f we g o t a s o l u t i o n , we show i t ,
o t h e r w i s e , we
48 // i n f o r m t h a t t h e r e i s no s o l u t i o n .
49
50 // we measure t h e e x e c u t i o n t i m e
51 c l o c k t s t a r t , end ;
52 start = clock () ;
53
54 try {
55 // i n s t a n c i a t e t h e b i s e c t i o n method
56 B i s e c t i o n M e t h o d bm( a 0 , b 0 , e p s i l o n ) ;
57
58 // s t a r t t h e c o m p u t a t i o n
59 c o u t << " c o m p u t i n g . . . " << e n d l ;
60 bm . s t a r t ( ) ;
61
62 // g e t t h e e n d i n g c l o c k
63 end = c l o c k ( ) ;
64
65 // p r i n t t h e e x e c u t i o n t i m e
66 double e x e c t i m e = ( double ) ( end−s t a r t ) /CLOCKS PER SEC ;
67 c o u t << " E x e c u t i o n t i m e : " << e x e c t i m e << " m s " << e n d l ;
68
69 // c h e c k wh e th e r a r o o t ha s been f o u n d o r n o t .
70 i f (bm . e x i s t R o o t ( ) ) {
71 // show t h e r e s u l t s

18
72 c o u t << " R o o t f o u n d : f ( x ) = 0 w i t h x = " << bm . g e t R o o t ( ) << " i n
[ " << a 0 << " , " << b 0 << " ] " << e n d l ;
73 } else {
74 c o u t << " N o s o l u t i o n f o u n d " << e n d l ;
75 }
76 } catch ( I n v a l i d I n t e r v a l E x c e p t i o n& ex ) {
77
78 // t h i s c o d e i s e x e c u t e d when t h e NoRootException has been throw
away .
79 // t h e what ( ) method t e l l us t h e r e a s o n o f t h e e x c e p t i o n
80 c o u t << " I n v a l i d i n i t i a l i n t e r v a l , e x i t i n g . . . " << e n d l ;
81
82 // we r e t u r n 1 a s s o c i a t i n g i t t o t h e i n v a l i d interval error .
83 return 1 ;
84 } catch ( D i v e r g e n c e E x c e p t i o n& ex ) {
85
86 // t h i s c o d e i s e x e c u t e d when t h e NoRootException has been throw
away .
87 // t h e what ( ) method t e l l us t h e r e a s o n o f t h e e x c e p t i o n
88 c o u t << " t h e m e t h o d d o n o t c o n v e r g e " << e n d l ;
89
90 // we r e t u r n 2 a s s o c i a t i n g i t t o t h e d i v e r g e n c e problem .
91 return 2 ;
92
93 } catch ( e x c e p t i o n& ex ) {
94 // t h i s c o d e i s e x e c u t e d when any o t h e r E x c e p t i o n has been throw
away .
95 // t h e what ( ) method t e l l us t h e r e a s o n o f t h e e x c e p t i o n
96 c o u t << " E R R O R : w e c a t c h a n e x c e p t i o n . r e a s o n " << ex . what ( ) <<
endl ;
97
98 // we r e t u r n 3 a s s o c i a t i n g i t t o any o t h e r e r r o r r e a s o n
99 return 3 ;
100 }
101
102 // r e t u r n w i t h 0 t o i n f o r m t h e i n t e r p r e t e r
103 // t h e a p p l i c a t i o n has ended n o r m a l l y .
104 // any v a l o r b i g g e r than 0 i s c o n s i d e r e d an annormal t e r m i n a t i o n
105
106 return 0 ;
107 }
108
109 // Note t h a t we do n o t need t o r e l e a s e t h e memory u s e d by t h e bm o b j e c t .
110 // s i n c e i t was n o t c r e a t e d by a new command , t h e r e f o r e , t h e e x i s t e n c e
of
111 // t h e o b j e c t i s bounded t o t h e s c o p e o f t h e f u n c t i o n ( method ) where i t
was
112 // i n s t a n t i a t e d . I n o t h e r words , when t h e main r o u t i n e w i l l f i n i s h , t h e
113 // memory w i l l a u t o m a t i c a l l y r e l e a s e d . A t t e n t i o n : i t i s n o t t h e same
114 // when t h e o b j e c t i s i n s t a n t i a t e d by means o f a new command .
115 //
116 // Enjoy !
117 // JcM

The first consideration of this code is we use exceptions to catch the problems
that the method may have when running. In object oriented programing, an
exception is defined as any condition that may produce an error in the execution
of the code. These exceptions can be “caught” and handled by the user (who
is programming the applicaiton) in order to avoid the error in the execution.
In this way, all the code inside the clause “try” will be “catcheable” and by
polymorphism we identify the kind of the exception that was triggered from the
methods used within the “try” scope. In particular, in the bisection method
we have two important problems that need to be caught: 1) the algorithm

19
do not converge, and 2) the initial parameters are incorrect. In this way, the
implementation of the bisection method should evaluate these conditions and
throw the appropriate exceptions, allowing the programmer to handle these
situations. For that, we define as divergence exception when the algorithm
iterates more than 1e9 times.
The second consideration is about the design of the BisectionMethod class.
We define as interface three methods: start(), existRoot() and getRoot(). The
first one begins the iteration process with the parameters given at the construc-
tion of the object (the constructor receives the parameters to run). The second
method returns whether a solution has been found or not. And the third method
returns the solution found. For simplicity, we do not handle the case when the
programmer invokes this method and there is no solution found. But, this case
can be easily handled by defining a new exception.
In summary, we identified two main objects to be implemented: the Bisec-
tionMethod and the Exceptions. The BisectionMethod method acts as a solver
and the exceptions aid the user to handle the possible problems that may oc-
curs with the solver. In Listing 2, the interface of the class BisectionMethod is
defined.
Listing 2: Interface of the bisection method class (BisectionMethod.h)
1 #i f n d e f BISECTIONMETHOD
2 #d e f i n e BISECTIONMETHOD
3
4 /∗
5 ∗ I n t e r f a c e of the BisectionMethod c l a s s
6 ∗
7 ∗ Juan−C a r l o s M a u r e i r a
8 ∗ NLHPC − CMM
9 ∗ U n i v e r s i d a d de C h i l e
10 ∗ March , 2011
11 ∗/
12
13 #include <math . h>
14 #include " B i s e c t i o n M e t h o d E x c e p t i o n s . h "
15
16 #d e f i n e MAX ITERATIONS 1 e9
17
18 class BisectionMethod {
19 private :
20 double e p s i l o n ;
21 double a 0 ;
22 double b 0 ;
23
24 double r o o t ;
25 bool r o o t f o u n d ;
26
27 public :
28 // c o n s t r u c t o r
29 B i s e c t i o n M e t h o d ( double a 0 , double b 0 , double e p s i l o n ) ;
30
31 // No need o f d e s t r u c t o r ( no o b j e c t s c r e a t e d w i h t i n a
32 // i n s t a n c e o f t h i s c l a s s
33
34 /∗ s t a r t t h e method ∗/
35 void s t a r t ( ) ;
36
37 /∗ r e t u r n i f a r o o t has been f o u n d o r n o t ∗/
38 bool e x i s t R o o t ( ) ;
39

20
40 /∗ g e t t h e r o o t when i t exists , otherwise i t throw an e x c e p t i o n ∗/
41 double g e t R o o t ( ) ;
42 };
43
44 #e n d i f

With respect to the implementation of the BisectionMethod class, we recall


that the function f (x) is “hardcoded” for simplicity reasons. However, by using
some mathematical parsers, this class can be improved to support any function
without compiling the code. For the implementation of this class, we need to
keep in mind the exceptions we need to use in order to handle the possible er-
rors that the application might come upon the execution time. We define the
DivergenceException and InvalidIntervalException classes for that. The imple-
mentation of these two classes is discussed later. In the following, we introduce
the implementation of the BisectionMethod class.
Listing 3: Implementation of the bisection method class (BisectionMethod.cc)
1 #include " B i s e c t i o n M e t h o d . h "
2
3 /∗
4 ∗ Implementation o f the BisectionMethod c l a s s
5 ∗
6 ∗ f ( x ) = x ˆ2 − 1 , h a r d c o d e d t o s i m p l i f y t h e example
7 ∗
8 ∗
9 ∗ Juan−C a r l o s M a u r e i r a
10 ∗ NLHPC − CMM
11 ∗ U n i v e r s i d a d de C h i l e
12 ∗ March , 2011
13 ∗/
14
15 /∗ f ( x ) d e f i n i t i o n ( h a r d c o d e d ) ∗/
16 double f ( double x ) {
17 return ( x∗x − 1 ) ;
18 }
19
20
21 BisectionMethod : : B i s e c t i o n M e t h o d ( double a 0 , double b 0 , double
epsilon ) {
22 // i n i t i a l i z e the p r i v a t e variables
23 t h i s −>a 0 = a 0;
24 t h i s −>b 0 = b 0;
25 t h i s −>e p s i l o n = epsilon ;
26
27 t h i s −>r o o t f o u n d = f a l s e ;
28
29 // c h e c k when f ( a 0 ) ∗ f ( b 0 ) < 0 o r i n v a l i d epsilon
30 i f ( f ( a 0 ) ∗ f ( b 0 ) > 0 | | e p s i l o n <=0 ) {
31 // i n v a l i d i n t e r v a l
32 throw I n v a l i d I n t e r v a l E x c e p t i o n ( ) ;
33 }
34 }
35
36 void B i s e c t i o n M e t h o d : : s t a r t ( ) {
37 // i m p l e m e n t a t i o n o f t h e b i s e c t i o n method .
38
39 long i n t k = 0 ;
40
41 double x k , a k , b k ;
42
43 // i n i t i a l conditions
44 a k = a 0;
45 b k = b 0;

21
46
47 while ( k < MAX ITERATIONS) {
48
49 x k = (a k + b k) / 2.0;
50 i f ( f a b s ( f ( x k ) ) < t h i s −>e p s i l o n ) {
51 // s o l u t i o n f o u n d
52 t h i s −>r o o t = x k ;
53 t h i s −>r o o t f o u n d = true ;
54 return ;
55 }
56 i f ( f ( x k ) ∗ f ( b k ) < 0) {
57 a k = x k;
58 } else {
59 b k = x k;
60 }
61 }
62
63 // MAX ITERATIONS r e a c h e d . t h e method do n o t c o n v e r g e
64 throw D i v e r g e n c e E x c e p t i o n ( ) ;
65 }
66
67 /∗ r e t u r n i f a r o o t has been f o u n d o r n o t ∗/
68 bool B i s e c t i o n M e t h o d : : e x i s t R o o t ( ) {
69 return t h i s −>r o o t f o u n d ;
70 }
71
72 /∗ g e t t h e r o o t when i t e x i s t s , o t h e r w i s e i t throw an e x c e p t i o n ∗/
73 double B i s e c t i o n M e t h o d : : g e t R o o t ( ) {
74 return t h i s −>r o o t ;
75 }

The exception classes defined for this application are inherited from the
C++ exception model. This design decision is considered “a best practice”
when using exceptions in C++. However, the user may implement its own
exception model. Normally, the method what() should return the reason for the
exception or any important value required to handle it in the “catch” scope.
Notice that we use polymorphism to identify the type of exception. Therefore,
the “catch” instruction shall receive a reference of a simple exception object,
and the “catch” argument will discriminate the object by the realization of it
(instanciation). Listing 4 presents the implementation of these classes. We join
the interface with the implementation without loss of generality of the code.
Listing 4: Implementation of the exception classes (BisectionMethodExcep-
tions.h)
1 #i f n d e f BISECTIONMETHODEXCEPTIONS
2 #d e f i n e BISECTIONMETHODEXCEPTIONs
3
4 /∗
5 ∗ B i s e c t i o n Method E x c e p t i o n s s e t
6 ∗ I n h e r i t e d from t h e C++ e x c e p t i o n c l a s s
7 ∗
8 ∗ More i n f o about C++ e x c e p t i o n s
9 ∗ h t t p : / /www. c p l u s p l u s . com/ doc / t u t o r i a l / e x c e p t i o n s /
10 ∗
11 ∗ Juan−C a r l o s M a u r e i r a
12 ∗ NLHPC − CMM
13 ∗ U n i v e r s i d a d de C h i l e
14 ∗ March , 2011
15 ∗/
16
17 #include <i o s t r e a m >
18 #include <e x c e p t i o n >

22
19
20 c l a s s I n v a l i d I n t e r v a l E x c e p t i o n : public s t d : : e x c e p t i o n {
21 public :
22 v i r t u a l const char ∗ what ( ) const throw ( ) {
23 return " I n v a l i d i n i t i a l i n t e r v a l " ;
24 }
25 };
26
27 c l a s s D i v e r g e n c e E x c e p t i o n : public s t d : : e x c e p t i o n {
28 public :
29 v i r t u a l const char ∗ what ( ) const throw ( ) {
30 return " T h e m e t h o d d o n o t c o n v e r g e " ;
31 }
32 };
33 #e n d i f

Finally, we define the “Makefile” to compile this application. In Listing 5


we propose a Makefile implementation for this purpose. It is worth to mention
that, as always, there are many ways of implementing this Makefile. The user is
invited to read more about how to create these files and implement it according
to its own criteria.

Listing 5: Makefile to complile the bisection method application (Makefile)


1 CXX=g++
2 TARGET=b i s e c t i o n
3 SRC=main . cpp B i s e c t i o n M e t h o d . c c
4 OBJECTS=$ ( a d d s u f f i x . o , $ ( basename $ (SRC) ) )
5
6 a l l : main
7
8 %.o : %. c c
9 $ (CXX) −c $<
10
11 . cpp . o :
12 $ {CXX} −c $<
13
14 main : $ (OBJECTS)
15 $ (CXX) ∗ . o −o $ (TARGET)
16
17 clean :
18 rm ∗ . o
19 rm $ (TARGET)

Once these codes are written in the Levque Cluster, the user only must issue
the command “make” to start the compilation process. We highlight that the
internals of a compilation process is very complex process and it can fail due to
multiple reasons. The user must adquire experience in compiling code in order
to deal with compilation errors.
After a successful compilation process, the generated target binary is called
“bisection”. As already mentioned, this binary application receives three ar-
guments, a0 , b0 (the interval to explore), and the tolerance . To run this
application, we must create an execution script. This script must define the job
properties as well as the instructions to prepare the required input parameters
and handling the output of the execution. For that, we create an job array
execution script according to the section 4.3.2. We define a job array from 1
to 500 with steps of 1 to compute the tolerance  = 100∗SGE 1T ASK ID . Then,
we execute the application redirecting the output (stdout) to a temporal file in

23
order to parse it afterwards extracting the solution found. An important remark
when using job arrays and temporal files is this file should not have the same
name, otherwise, results will be overwritten by the sibling jobs within the array.
Therefore, we use a random temp file in order to avoid this problem. When
using this trick, it is important to remove the temp file before exiting from the
script to avoid a pollution of temp files within the computing area. In Listing
6 the execution job script for the bisection application is presented.

Listing 6: Job Script for the bisection method application (bisection job.sh)
1 #! / b i n / bash
2 # SGE Job Array d e f i n i t i o n
3 #$ −cwd
4 #$ −j n
5 #$ −n o t i f y
6 #$ −M jcm@dim . u c h i l e . c l
7 #$ −m a b e s
8 #$ −N b i s e c t i o n
9 #$ −S / b i n / bash
10 #$ −q a l l . q
11 #$ −t 1 −500:1
12
13 # b i s e c t i o n method s t u d y s c r i p t
14 SOLVER=./ b i s e c t i o n
15
16 # initial conditions
17 a 0=0
18 b 0 =100
19
20 # compute t h e e p s i l o n t o e v a l u a t e
21 e p s i l o n =‘echo " s c a l e = 1 0 ; 1 / ( 1 0 0 * $ S G E _ T A S K _ I D ) " | bc ‘
22
23 # c r e a t e a temp f i l e ( i n t h e /tmp ) t o g r a b t h e s o l v e r o u t p u t
24 OUTPUT=‘mktemp ‘
25
26 # evaluate the e p s i l o n
27 $ {SOLVER} $ a 0 $ b 0 $ e p s i l o n > $OUTPUT
28
29 # p a r s e t h e o u t p u t t o g e t t h e computed r o o t
30 # t h e r o o t i s i n t h e l a s t l i n e , 9 th t o k e n
31 X=‘ cat $OUTPUT | t a i l −n 1 | awk ’ { p r i n t $9 } ’ ‘
32
33 # p r i n t the r e s u l t s
34 echo $ e p s i l o n $X
35
36 # p l e a s e , remember t o remove t h e temp o u t p u t file
37 rm $OUTPUT
38
39 #EOF

The previous execution script write the output of the application in the
stdout (see main.cpp cout instructions). Thus, as the workload system redirects
the output of the job to an output file, for this example the output file is
called bisection.o plus the Job ID. As this job is a job array, each execution
within the job array will be identified by adding the number of the job to the
extension of the output file. In this way, when submitting our job script, we
should have 500 outputs files called “bisection.oPID.x”, being PID the Job ID
and x ∈ [1, 500].
As we parse the output of the binary application in order to print the infor-
mation we need, each output file will contain a single line with the  value and

24
the solution for f (x) = 0. Therefore, our post processing script should gather
all these output files, build a table ordered by  value, and create a plot of 
versus the numerical solution for x. For that, we relay on the gnuplot gnuplot
application. As the Levque Cluster does not provide visualization services, the
user should generate all the graphical results in the form of output files. Thus,
we create a gnuplot script that gather the results files and write the plot in an
Encapsulated Postscript File (EPS). Listing 7 shows this post processing script.
Notice that we also plot the theoretical solution x = 1 to compare the results.

Listing 7: Post-processing script for the bisection method application (bisec-


tion ps.gnuplot)
1 #! / u s r / b i n / g n u p l o t
2
3 # make a p l o t w i t h t h e r e s u l t s .
4 # f o r t h a t , we p a s s an s m a l l s c r i p t
5 # t o g n u p l o t t o g e t an e p s f i l e
6
7 s e t term p o s t s c r i p t e p s enhanced c o l o r " ;
8 set output " b i s e c t i o n r e s u l t s . eps " ;
9 s e t x l a b e l " T o l e r a n c e o f t h e B i s e c t i o n Method ( l o g s c a l e ) " f o n t
" H e l v e t i c a , 20 " ;
10 s e t y l a b e l " S o l u t i o n f o u n d by t h e Method ( x ) " f o n t " H e l v e t i c a , 2 0 " ;
11 set xrange [0.00001:.001] reverse ;
12 set grid ;
13 set logscale x ;
14
15 set style line 1 lt 3 lw 3 pt 3;
16 set style line 2 lt 1 lw 3 pt 3;
17
18 plot 1 with lp title " T h e o r e t i c a l s o l u t i o n " ls 1 , \
19 "< c a t b i s e c t i o n . o ∗ | s o r t −n −k 1 " u s i n g 1 : 2 w i t h l p t i t l e
" Numerical s o l u t i o n " ls 2;

The output of this script is presented in Figure 5. The theoretical solution


x = 1 is illustrated as reference by a segmented line (in blue), and the numerical
solution of the method is depicted by the solid red line. For each solution, a star
over the solid line is plotted. The x-axis is log-scaled to improve the readability
for small values of . Note that the numerical approximation for f (x) = 0 is not
good for  > 0.0003 (from both sides of the root x = 1). However, for  < 0.0003
(approximately), the quality of the solution is not bad, and for  < 0.00004 the
method converge exactly to the theoretical solution (or at least the difference is
negligible according to the numerical representation of the processor).

5.2 A parallel application example


The aim of this section is to illustrate how the user should interact with the
Levque Cluster when dealing with parallelism. For that, we choose the classical
problem of fork and join, which is well studied in the operating system courses.
This problem consists in performing a fork in the execution of an application
and then wait for all the child process to join the main execution thread. When
solving this problem in C, the fork() function is commonly used, but the user
must use signals to implement the join of children processes. In C++, the fork
can be implemented by using threads (the pthread library for example), and

25
Figure 5: Numerical solution for f (x) = 0 using the bisection method for several
tolerance values

the join of process is more easy to implement. But, when solving this problem
assuming that each child process is running in a diferent CPU (or even different
hosts), the use of MPI is required. There are many ways to implement the fork
and join problem by using MPI. Here, we show one of the simplest one.
In a glimpse of MPI, an application is executed in different CPUs (or nodes)
at the same time. Conversely to forks or threads, the MPI process begins de-
tached from the beginning (there is no parent process or explicit forks). So,
each process is identified by a Process ID, or PID, and the MPI library provides
methods to communicate these process in different ways. The communication
between process is classified in blocking and non-blocking communication, mean-
ing when a process is receiving a communication, the execution is blocked or
not due to the receiving instruction. Also MPI provides many communication
paradigms, such as unicast (one to one), broadcast (one to many), collective
(many to one), among others.
For this example, we use a blocking unicast communication to wait for the
other process to finish. In other words, we assign the role of master to a single
process, and the rest are considered as slaves. As the parallel application
begins its execution in a detached state, the master process will wait for each
slave process to communicate its ending, and each slave process will perform
some calculation and then return the result before ending. So, for n process,
the master should receive n − 1 communications before continue its execution.
Implementing this example in C++, we propose an object oriented design

26
which is based in an abstract MPI process. As this class cannot be instantiated
(because it is abstract), we use the Factory pattern [2] to create instances of an
MPI process. This factory delivers two kinds of MPI process, the master and
the salve process. So, the Factory object has the responsibility of checking the
PID of each process and deliver the correct MPI process instance (master for
PID 0 and slave for PID¿0). The main execution routine only commands the
MPIProcess instance to begin its execution, and by polymorphism, the applica-
tion runs the appropriate run method. Therefore, we have two derived classes:
the MPIMasterProcess and the MPISlaveProcess. We use this trick to avoid
implementing the logic of the application in the main execution routine. Re-
member that it is executed on each CPU (or host), therefore, we have a cleaner
and understandable code instead to have a large main routine implementing
everything. The Listing 8 shows this main routine.

Listing 8: Starting routine for a MPI applicaiton (main.cpp)


1 /∗
2 ∗ C++ MPI Example o f c o l l e c t i v e j o i n
3 ∗ NLHPC − CMM
4 ∗ Juan−C a r l o s M a u r e i r a
5 ∗ C e n t e r f o r M a t h e m a t i c a l Modeling
6 ∗ U n i v e r s i d a d de C h i l e
7 ∗/
8
9 #include <c s t d l i b >
10 #include <i o s t r e a m >
11 #include <iomanip>
12
13 using namespace s t d ;
14
15 #include " m p i . h "
16 #include " M P I O b j e c t F a c t o r y . h "
17
18 i n t main ( i n t a r g c , char ∗ a r g v [ ] ) {
19 c o u t << " N L H P C C + + M P I e x a m p l e 1 " << e n d l ;
20
21 MPI : : I n i t ( a r g c , a r g v ) ;
22
23 MPIProcess ∗ p = MPIObjectFactory : : g e t M P I P r o c e s s ( ) ;
24 p−>run ( ) ;
25
26 MPI : : F i n a l i z e ( ) ;
27
28 return 0 ;
29 }

Notice that each process must initializate the MPI environment (MPI::Init)
and finalize it when exiting (MPI::Finalize). The code running on each CPU
or host is the same: get a MPIProcess instance and call its run method. The
MPIObjectFactory::getMPIProcess() method returns the appropriate instance
of the MPIProcess according the rank (PID) of each process (master for rank
0 and slave for rank ¿ 0). As both methods are inherited from MPIProcess,
the polymorphism will do “the magic” to invoke the correct implementation
of the run method. Before to discuss the implementation of each MPIProcess,
Listening 9 shows the Factory pattern implemented for the MPIProcess abstract
class.

27
Listing 9: The MPI Object Factory (MPIObjectFactory.h)
1 #i f n d e f MPIOBJECTFACTORY
2 #d e f i n e MPIOBJECTFACTORY
3
4 #include " m p i . h "
5 #include " M P I M a s t e r P r o c e s s . h "
6 #include " M P I S l a v e P r o c e s s . h "
7
8 c l a s s MPIObjectFactory {
9 private :
10 s t a t i c MPIProcess ∗ c r e a t e M a s t e r P r o c e s s ( ) {
11 return new MPIMasterProcess ( ) ;
12 }
13 s t a t i c MPIProcess ∗ c r e a t e S l a v e P r o c e s s ( ) {
14 return new M P I S l a v e P r o c e s s ( ) ;
15 }
16 public :
17 s t a t i c MPIProcess ∗ g e t M P I P r o c e s s ( ) {
18 i n t rank = MPI : :COMMWORLD. G e t r a n k ( ) ;
19 i f ( rank == 0 ) {
20 return c r e a t e M a s t e r P r o c e s s ( ) ;
21 }
22 return c r e a t e S l a v e P r o c e s s ( ) ;
23 }
24 };
25
26 #e n d i f

Notice that the method getMPIProcess() has the intelligence to return the
appropriate MPIProcess instance according to the rank of the MPI process.
Another important remark is this factory object is static. More details about
why it must be static and other examples can be found in [2].
Now, we define the interface of the MPIProcess abstract class. This class
is considered abstract since it has a virtual method that is not linked to its
implementation (the “=0” in line 18). Notice also that this class provides the
MPI functionality to get the rank of the process, send or receive data, etc.
All the common implementation between the master and slave process must
be implemented in this class. Otherwise, the user should implement each one
by separate on each derived class, increasing the maintenance cost of the code.
Always is better to factorize the common methods in a ancestor class, as in
this case, the MPIProcess. As the run() method is abstract, the user is forced
to implement it when deriving from MPIProcess. Otherwise, the compilation
process will throw an error when linking the code. Listing 10 and 11 present
the code for the interface definition and the implementation for the MPIProcess
class.

Listing 10: Interface for the MPI Process base class (MPIProcess.h)
1 #i f n d e f MPIPROCESS
2 #d e f i n e MPIPROCESS
3
4 c l a s s MPIProcess {
5 protected :
6 i n t rank ;
7 public :
8 MPIProcess ( ) ;
9 ˜ MPIProcess ( ) ;
10
11 i n t getRank ( ) ;

28
12 int getProcessNumber ( ) ;
13
14 void s e n d ( i n t d s t , i n t d a t a ) ;
15
16 void r e c v ( i n t d a t a ) ;
17
18 v i r t u a l void run ( ) =0;
19
20 };
21
22 #e n d i f

Listing 11: Implementation for the MPI Process base class (MPIProcess.cc)
1 #include " m p i . h "
2 #include " M P I P r o c e s s . h "
3
4 MPIProcess : : MPIProcess ( ) {
5 t h i s −>rank = MPI : :COMMWORLD. G e t r a n k ( ) ;
6 }
7
8 MPIProcess : : ˜ MPIProcess ( ) {
9
10 }
11
12 i n t MPIProcess : : getRank ( ) {
13 return t h i s −>rank ;
14 }
15
16 i n t MPIProcess : : g e t P r o c e s s N u m b e r ( ) {
17 return MPI : :COMMWORLD. G e t s i z e ( ) ;
18 }
19
20 void MPIProcess : : s e n d ( i n t d s t , i n t d a t a ) {
21 MPI : :COMMWORLD. Send(& data , 1 , MPI INT , d s t , NULL) ;
22 }
23
24 //TODO: u s e t e m p l a t e t o r e c e i v e a l l t h e p r i m i t i v e t y p e s o f d a t a
25 void MPIProcess : : r e c v ( i n t d a t a ) {
26 MPI : :COMMWORLD. Recv(& data , 1 , MPI INT , MPI ANY SOURCE, NULL) ;
27 }

Now, the foundations of our MPI application are settled. From here, the
user can implement any MPI process deriving from this base object model. As
this example aims to solving the join and fork problem, we implement a master
and slave process inherited from this object model. Thus, the master process
waits for the n − 1 slave process to finish and then return the execution to
the main thread (the main.cpp in the rank 0). Listing 12 and 13 presents the
MPIMasterProcess class interface and implementation. Notice the only method
implemented is the run(), since all the rest is inherited from the MPIProcess
base class.

Listing 12: Interface for the MPI Master Process (MPIMasterProcess.h)


1 #i f n d e f MPIMASTERPROCESS
2 #d e f i n e MPIMASTERPROCESS
3
4 #include " M P I P r o c e s s . h "
5
6 c l a s s MPIMasterProcess : public MPIProcess {
7 public :
8 MPIMasterProcess ( ) ;

29
9 v i r t u a l void run ( ) ;
10 };
11
12 #e n d i f

Listing 13: Implementation for the MPI Master Process (MPIMasterProcess.cc)


1 #include <c s t d l i b >
2 #include <i o s t r e a m >
3 #include <ctime>
4 #include " MPIMasterProcess .h"
5
6 MPIMasterProcess : : MPIMasterProcess ( ) {
7 s t d : : c o u t << " M P I M a s t e r P r o c e s s C o n s t r u c t o r " << s t d : : e n d l ;
8 }
9
10 void MPIMasterProcess : : run ( ) {
11 s t d : : c o u t << " r u n m a s t e r p r o c e s s " << s t d : : e n d l ;
12
13 // w a i t i n g f o r s l a v e s to f i n i s h
14
15 int count = 0 ;
16 while ( c o u n t < t h i s −>g e t P r o c e s s N u m b e r ( ) −1) {
17 int r = 0 ;
18 t h i s −>r e c v (& r ) ;
19 s t d : : c o u t << " m a s t e r p r o c e s s : s o m e s l a v e h a s e x i t e d " << s t d : : e n d l ;
20 c o u n t++;
21 }
22
23 // c l e a n e v e r y t h i n g and r e t u r n
24
25 return ;
26 }

The logic of the master process is implemented in the run() method. Notice
that it waits by means of the blocking method recv() the arrival of the response of
a slave process, and then it counts plus 1 to the overall count of ended process
(count variable). We say “blocking” since the implementation of the recv()
method uses the Recv method of MPI, which is blocking by default. When
the ended process count reaches the total number of process, not counting the
master one, the master run() method ends.
For the MPISlaveProcess, we follow the same idea, but we implement the
run() method according the slave process. Listing 14 and 15 shows the interface
and the implementation of the MPISlaveProcess class.

Listing 14: Interface for the MPI Slave Process (MPISlaveProcess.cc)


1 #i f n d e f MPISLAVEPROCESS
2 #d e f i n e MPISLAVEPROCESS
3
4 #include " M P I P r o c e s s . h "
5
6 c l a s s M P I S l a v e P r o c e s s : public MPIProcess {
7 public :
8 MPISlaveProcess ( ) ;
9 v i r t u a l void run ( ) ;
10 };
11
12 #e n d i f

30
Listing 15: Implementation for the MPI Slave Process (MPISlaveProcess.cc)
1 #include <c s t d l i b >
2 #include <i o s t r e a m >
3 #include <ctime>
4 #include " MPISlaveProcess .h"
5
6 MPISlaveProcess : : MPISlaveProcess ( ) {
7 s t d : : c o u t << " M P I S l a v e P r o c e s s C o n s t r u c t o r " << s t d : : e n d l ;
8 }
9
10 void M P I S l a v e P r o c e s s : : run ( ) {
11 s t d : : c o u t << " r u n s l a v e p r o c e s s " << s t d : : e n d l ;
12
13 // do s o m e t h i n g
14 s r a n d ( t h i s −>getRank ( ) ) ;
15
16 i n t wt = ( rand ( ) % 1 0 ) + 1 ;
17 s t d : : c o u t << " p r o c e s s " << t h i s −>getRank ( ) << " w a i t i n g f o r " << wt
<< " s e c o n d s " << s t d : : e n d l ;
18 s l e e p ( wt ) ;
19
20 s t d : : c o u t << " p r o c e s s " << t h i s −>getRank ( ) << " n o t i f y i n g m a s t e r a n d
e x i t i n g " << s t d : : e n d l ;
21
22 t h i s −>s e n d ( 0 , 1 ) ;
23
24 // c l e a n e v e r y t h i n g and r e t u r n
25
26 return ;
27 }

For this example, the run() method will wait a random time before noticing
the ending of the process to the master and return. Note that we seed the ran-
dom number generator with the rank of the process in order to get a different
random stream of numbers within each process, otherwise, all of them will gen-
erate the same stream of random numbers, which is not useful for our purpose.
Also note the method uses the send() method to inform the master (rank 0)
that the slave process has ended. For that, we send a number 1 to the master.
Why 1?? indeed, it does not matter, it can be any integer number. For sending
a float or any other primitive type of data, the MPIProcess must implement the
send and receive methods for those types of data. The only implemented is the
integer one.
Now the code is complete and we need to compile it. We use a “Makefile” to
perform this task in a similar way than the previous example. However, we use
the mpic++ compiler wrapper to avoid adding the MPI libraries and includes
paths. This wrapper, commonly provided by MPI implementations, may differ
on its name, so the user must be careful when invoking the correct compilation
wrapper. For this example, we use the OpenMPI implementation, so, we load
the corresponding module before compiling the code as mentioned in Section
4.6 Listing 16 depicts how this Makefile looks like.

Listing 16: Makefile to compile the MPI example application (Makefile)


1 CXX=mpic++
2 TARGET=m p i e x a m p l e 1
3 SRC=main . cpp MPIProcess . c c MPIMasterProcess . c c M P I S l a v e P r o c e s s . c c
4 OBJECTS=$ ( a d d s u f f i x . o , $ ( basename $ (SRC) ) )
5

31
6 a l l : main
7
8 %.o : %. c c
9 $ (CXX) −c $<
10
11 . cpp . o :
12 $ {CXX} −c $<
13
14 main : $ (OBJECTS)
15 $ (CXX) ∗ . o −o $ (TARGET)
16
17 clean :
18 rm ∗ . o
19 rm $ (TARGET)

The compilation process is started by issuing the command “make” within


the directory where the Makefile and the codes are. We do not mention the
possible compilation errors the user might face since this document is meant to
illustrate how to build applications, not how to compile them. We assume the
user manages the compilation process and is capable to complete it successfully.
The generated binary is called mpi example 1 and to run it, we need to invoke
it through the mpirun execution wrapper. This wrapper requires as argument
the number of cores where the application will be executed, and once finished,
all the stdout of each process will be combined in the execution process stdout.
In the following, we show the output of the application when executing it in the
Levque Cluster.
[jcm@development ~/examples]$ mpirun -n 5 ./mpi_example_1
NLHPC C++ MPI example 1
NLHPC C++ MPI example 1
NLHPC C++ MPI example 1
NLHPC C++ MPI example 1
NLHPC C++ MPI example 1
MPI Master Process Constructor
run master process
MPI Slave Process Constructor
run slave process
process 1 waiting for 4 seconds
MPI Slave Process Constructor
run slave process
process 2 waiting for 1 seconds
MPI Slave Process Constructor
run slave process
process 4 waiting for 2 seconds
MPI Slave Process Constructor
run slave process
process 3 waiting for 7 seconds
master process: some slave has exited
process 2 notifying master and exiting
master process: some slave has exited
process 4 notifying master and exiting
process 1 notifying master and exiting
master process: some slave has exited
process 3 notifying master and exiting
master process: some slave has exited
[jcm@development ~/examples]$

Notice the banner text is shown 5 times, indicating the main routine is
executed on each one of the 5 requested CPUs. After each MPI process is
created, our application will assign as master to the process rank 0 and slaves to
the rest of the process (rank¿0). Then, the run method is invoked. The master

32
waits for the children process and the slaves waits for a random time before
to noticing the master about its termination. When the last slave process has
ended, the main process finish its run method, returning the execution control
to the main routine.
When running this example as a job, we require a MPI job submission script.
As mentioned in Section 4.3, a parallel job script must request for a parallel
environment to use. As we compile out application by using the OpenMPI
library, we must run it under this parallel environment. Therefore, the job
script looks like the one shown in Listing 17.

Listing 17: Makefile to compile the MPI example application (Makefile)


1 #! / b i n / bash
2 #$ −cwd
3 #$ −j n
4 #$ −n o t i f y
5 #$ −M jcm@dim . u c h i l e . c l
6 #$ −m a b e s
7 #$ −N m p i e x a m p l e 1
8 #$ −S / b i n / bash
9 #$ −pe openmpi 5
10 #$ −q a l l . q
11
12 # load the c o r r e c t environmnet to runt
13 source / e t c / p r o f i l e . d/ modules . c s h
14 module l o a d o p e n m p i g c c 4 4
15
16 # run t h e mpi a p p l i c a t i o n
17 mpirun . / m p i e x a m p l e 1

The submission of this script is described in Section 4.4 and the results are
recovered in the same way than the previous example (the output file generated
by the workload system).

5.3 Final words


In this section we presented two example applications. The objective is to
illustrate the user how to interact with the Levque Cluster when programming
scientific applications. But, we must declare that there are many languages and
styles to programming them. Therefore, there is no absolute “the best” to solve
a problem. However, depending on the type of it and scope of the development,
there are some languages and techniques that are more suitable than another.
Here we use C++ to illustrate our examples. In particular, we used an object
oriented approach to address the software process of building an application.
We invite the user to explore more deeply the aspects of the object oriented
programming in [7] and [8]. The latter one is devoted to design patterns that
improve robustness and maintainability of applications, features that are more
than desirable when creating applications.

References
[1] Bash linux commands. http://ss64.com/bash/.

33
[2] Factory pattern. http://en.wikibooks.org/wiki/C++_Programming/
Code/Design_Patterns.
[3] Gnu make. an introduction to makefiles. http://www.apl.jhu.edu/Misc/
Unix-info/make/make_2.html.

[4] Lustre filesystem. http://www.lustre.org.


[5] Open message passing interface (openmpi) documentation. http://www.
open-mpi.org/doc/.
[6] Oracle grid engine user manual (formally called sun grid engine.
http://download.oracle.com/docs/cd/E19080-01/n1.grid.eng6/
817-6117/index.html.
[7] Bruce Eckel. Thinking in C++. Prentice-Hall, Inc., Upper Saddle River,
NJ, USA, 1995.
[8] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design
patterns: elements of reusable object-oriented software. Addison-Wesley Pro-
fessional, 1995.

34

Das könnte Ihnen auch gefallen