Beruflich Dokumente
Kultur Dokumente
IN
HIGH PERFORMANCE COMPUTING (HPC) CLUSTERS
By
I.SAI CHARAN
2017506588
B. Tech – Computer Science and Engineering
AT
ANNA UNIVERSITY
May-June, 2019
ACKNOWLEDGEMENTS
I take this opportunity to thank Smt. Uma Makeshwari, Head, CSC, Sastra Deemed
to be University, Thanjavur, for granting me permission to carry out this project in
IGCAR, Kalpakkam.
I would like to express my sincere thanks to my guideDr. M. L. Jayalal, Head,
Computing System Section (CSS), Computer Division (CS) for sharing his expertise
and invaluable time to carry out this project.
I am grateful to Smt. DeepikaVinod, who has provided me the technical support for
my project work. I sincerely acknowledge her expert guidance without which I could
not have completed this project.
I am extremely thankful to Dr. A.K. Bhaduri, Director, Indira Gandhi Centre for
Atomic research (IGCAR) for giving me the permission and providing the facilities to
carry out the project work in Computer Division.
2
TABLE OF CONTENTS
ACKNOWLEDGEMENTS II
ABSTRACT V
1. Introduction 6
2. Linux Programming 9
3. Parallel Programming 11
3.5.1 Introduction 14
3.5.7 Reductions 18
3.5.8 Scheduling 18
3
3.6.1 MPI Routines 21
4. Parallel Programms 23
5. Conclusion 31
6. References 32
4
ABSTRACT
The accelerating rise in problems and their complexity urges the need for higher
computational capabilities. Advancements in processors and other hardware can
only take us this far. The next logical step is to move from single processor devices
to multi-core technologies.
The aim of this project is to parallelize a few problems that are traditionally solved
by linear programming methods, and try executing them in HPC Clusters. Thus, a
study of a few parallel programs were done on MPI and OpenMP libraries and a
performance comparison was carried out.
5
1. INTRODUCTION
Indira Gandhi Centre for Atomic Research (IGCAR), Kalpakkam is one of India’s
premier Atomic research centres. The reactor research centre (RRC) was established
in 1971, which was renamed as the present IGCAR by the then Prime Minister of
India, Rajiv Gandhi in December 1985. IGCAR is engaged in broad-based
multidisciplinary programme of scientific research and advanced engineering
directed towards the development of Fast Breeder Reactor Technology, in India.
This is the second largest establishment of the Department of Atomic Energy (DAE),
next to Bhabha Atomic research Centre (BARC), Mumbai. This centre has a Fast
Breeder Test Rector (FBTR) and various scientific and engineering establishments. In
the campus, a reprocessing plant called The Kalpakkam Reprocessing Plant (KARP),
which reprocess the spent uranium fuel in the Nuclear reactor to retrieve plutonium.
6
HPCs are implemented in several fields of research such as:
Aeronautical Sciences
Drug Testing
Material Sciences
Supercomputing Clusters
IVY Cluster :
◦ 51 Terabytes memory
◦ 200 Teraflops
NEHA Cluster :
7
◦ 134 Compute nodes
◦ 16.7 Teraflops
XEON Cluster :
◦ 1 TB of distributed memory
◦ 9.5 TFLOPS
8
2. LINUX PROGRAMMING
2.1 Intro to Linux
Linux is a family of open source Unix-like operating systems based on the Linux
Kernel. Linux is designed mainly for mainframes and is in enterprises and
universities.
The experiments of this Project have been carried out in a Linux environment. Some
of the key Linux commands use are specified below. These commands helped
navigate through directories, view, create and modify files.
9
remove directories.
10. echo echo <specify the string to be Display the line of text.
printed>
10
3. PARALLEL PROGRAMMING
Example:
In the example shown in Figure 2, do_payroll()is a function that is executed serially.
It consists of several internal calculations that are shown by each single block. Each
stage of calculation is performed one after the other at individual times tn.
11
3.2 Parallel Computing
Parallel Computing refers to the simultaneous use of multiple compute resources to
solve a computational problem. A problem is broken down into discrete parts that
can be solved concurrently, that are further broken down into a set of instructions
which execute simultaneously on different processors, as shown in Figure 3.
Example:
In the example depicted in Figure 3, the same function do_payroll() seen in Figure 2
is performed in a Parallel fashion. The function is split into 4 parts, each part
responsible for computing payroll for a particular employee. All these four segments
execute simultaneously by different processors. The end result from each processor
is later combined.
12
3.3 Two Architectures of Parallel Computing:
Shared Memory Architecture: Distributed Memory Architecture:
• Collection of nodes which have multiple
• Multiple cores cores
• Each node uses its own local memory
• Share a global memory space
• Work together to solve a problem
13
3.5 Open Multi-Processing (OpenMP):
3.5.1 Introduction:
OpenMP is designed for multi-processor shared memory machines that may have
the given architectures, namely Uniform and Non-Uniform Memory Access
architectures.
Uniform Memory Access (UMA): UMA is a shared memory architecture in which all
the processors share the physical memory uniformly. In UA, the access time to a
memory location is independent of which processor makes the request or which
memory chip contains the transferred data.
14
FIGURE 6: NON-UNIFORM MEMORY ACCESS
15
3.5.5 OpenMP API Overview:
The OpenMP API comprises of three distinct components:
Compiler Directives: A total of 19 compiler directives for converting a serial to
parallel.
Runtime Library Routines: 32 Runtime library commands help get info on
running threads.
Environment Variables: 9 different environment variables help set different
conditions for program execution.
16
parallel region.
Example:
#include<omp.h>
intomp_get_num_threads(void);
Example:
export OMP_NUM_THREADS=8
3.5.7 Reductions:
It automatically Privatizes the specified variables and initialises the private instances
with a sensible starting value. At the end of the construct, all partial results are
accumulated into the shared instance to get the final result.
3.5.8 Scheduling:
OpenMP schedules the different iterations of a loop among the different threads
available for the program. The scheduling behaviour of the OpenMP can be
modified according to the user's will. There are five different types for scheduling,
namely:
static
dynamic
guided
auto
runtime
18
FIGURE 8: DIFFERENT MOETHODS OF SCHEDULING IN OPENMP
The example loop has 20 iterations and is executed by three threads (T1, T2
and T3).
Note that only the STATIC schedules guarantee that the distribution of chunks
among threads stays the same from run to run.
19
3.6 Message Passing Interface (MPI):
Message Passing Interface (MPI), is a standardized means of exchanging messages
between multiple computers running a parallel program across distributed memory.
The MPI defines a standard suite of functions for exchange of data between nodes
and to provide control over the parallel cluster.
MPI programs are made up of communicating processes, each process made up of
its own set of variables. Originally, MPI was designed for distributed memory
architectures, as shown in Figure 8.
20
3.6.1 MPI Routines:
Blocking and Non-Blocking:
Most of the MPI point-to-point routines can be use in either blocking or non-
blocking mode.
Blocking: A blocking send routine will only "return" after it is safe to modify
the sent data for reuse. Safe means that modifications will not affect the data
intended for the receive task.
Non-Blocking: Non-blocking send and receive routines will return almost
immediately. They do not wait for any communication events to complete
} MPI_Wait (...);
}
21
MPI_Send Send point-to-point message to
destination process in the communicator
group.
22
4. PARALLEL PROGRAMS
Matrix Multiplication involves elements of each row of the first matrix multiplied
with the elements of each column of the second matrix, and then adding their
products. This forms the required resultant matrix.
Since these operations are independent of each other, they can be parallelized to
run on multiple processors. We split the rows of the first matrix to different
processors and the second matrix is broadcast to all.
The program is done using MPI and OpenMP and the execution times are compared.
Statistics are recorded multiple times by increasing the order of the matrix.
23
{
for(j=0 ; j< n ;j++)
{
m3[i][j] = 0;
for(k=0;k< n; k++)
{
m3[i][j] = m3[i][j] + m1[i][k] * m2[k][j];
}
printf("The value m3[%d][%d] is %f\n",i,j,m3[i][j]);
}
}
endTime = clock();
execTime = endTime - startTime;
printf("%.15f \n",execTime/CLOCKS_PER_SEC);
return(0);
}
24
{
m3[i][j] = 0;
for(k=0;k< n; k++)
{
m3[i][j] = m3[i][j] + m1[i][k] * m2[k][j];
}
printf("The value m3[%d][%d] is %f\n",i,j,m3[i][j]);
}}}
An OpenMP code for Matrix Multiplication was written as shown above and
executed for [500 x 500] matrices. The program was written such that each iteration
of the main loop would be handled by a particular thread. This enabled the
simultaneous processing of all the iterations of the loop.
The program was tested with different number of running threads, and their
execution times were noted down in each case. The command "export
omp_num_threads=N" allowed us to specify exactly how many threads were to be
used for a particular execution test.
The execution times obtained for the matrix multiplication are as shown below:
25
GRAPH 1: EXECUTION TIME OF 500X500 MATRIX IN OPENMP
26
MPI_Send(&rows, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);
MPI_Send(&a[offset][0], rows*NCA, MPI_LONG_DOUBLE, dest, mtype,
MPI_COMM_WORLD);
MPI_Send(&b, NCA*NCB, MPI_LONG_DOUBLE, dest, mtype, MPI_COMM_WORLD);
offset = offset + rows;
}
/* Receive results from worker tasks */
mtype = FROM_WORKER;
for (i=1; i<=numworkers; i++)
{
source = i;
MPI_Recv (&offset, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);
MPI_Recv (&rows, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);
MPI_Recv (&c[offset][0], rows*NCB, MPI_LONG_DOUBLE, source, mtype, MPI_COMM_WORLD,
&status);
}
Worker Node:
if (taskid> MASTER)
{
mtype = FROM_MASTER;
MPI_Recv(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status);
MPI_Recv(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status);
MPI_Recv(&a, rows*NCA, MPI_LONG_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status);
MPI_Recv(&b, NCA*NCB, MPI_LONG_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status);
for (k=0; k<NCB; k++)
for (i=0; i<rows; i++)
{
c[i][k] = 0.0;
for(j=0; j<NCA; j++)
c[i][k] = c[i][k] + a[i][j] * b[j][k];
}
27
mtype = FROM_WORKER;
MPI_Send(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);
MPI_Send(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);
MPI_Send(&c, rows*NCB, MPI_LONG_DOUBLE, MASTER, mtype, MPI_COMM_WORLD);
}
MPI_Finalize();
The MPI code is split into the Master Node segment and the Worker node Segments.
The Master node splits and sends the rows of the first matrix to the respective
processors, and also collects the result computed by each processor and combines
them to form the final resultant matrix. The second matrix is broadcast to all Worker
Nodes.
The Worker nodes receive their respective rows from the Master Node and carry
out their tasks. Once a node has completed calculations, its result is sent back to the
Master Node. The Master Node finally combines all the results in order, and displays
the final output.
The observed execution times for [100 x 100] and [500 x 500] matrices are as follows:
28
GRAPH 2: EXECUTION TIME OF 100X100 MARIX IN MPI
It is seen from the graph that the execution time has a steady fall in the region of 5
to 20 cores. Thus, it is the optimum number of cores for this task. The spike in the
graph when number of cores exceeds 20 indicates that the overhead of organizing
the cores has exceeded the computational complexity of the program.
30
5. CONCLUSION
It was found that running a parallel program on multiple cores was feasible till as
long as the time required in initialising the processors was little compared to the
that actual execution of the program.
31
6. REFERENCES
1. https://computing.llnl.gov/tutorials/mpi
2. https://computing.llnl.gov/tutorials/openMP
3. https://en.wikipedia.org/wiki/Message_Passing _Interface
4. https://en.wikipedia.org/wiki/OpenMP
5. Books: Parallel Programming - Techniques
Parallel Programming with MPI
6. https://igcar.gov.in
7. https://en.wikipedia.org/wiki/Linux
32