Parallel Project Saicharan

STUDY ON PARALLEL COMPUTING
IN
HIGH PERFORMANCE COMPUTING (HPC) CLUSTERS
By
I.SAI CHARAN
2017506588
B. Tech – Computer Science and Engineering
AT
INDIRA GANDHI CENTRE FOR ATOMIC RESEARCH

KALPAKKAM
ANNA UNIVERSITY
May-June, 2019
ACKNOWLEDGEMENTS
I take this opportunity to thank Smt. Uma Makeshwari, Head, CSC, Sastra Deemed
to be University, Thanjavur, for granting me permission to carry out this project in
IGCAR, Kalpakkam.
I would like to express my sincere thanks to my guideDr. M. L. Jayalal, Head,
Computing System Section (CSS), Computer Division (CS) for sharing his expertise
and invaluable time to carry out this project.
I am grateful to Smt. DeepikaVinod, who has provided me the technical support for
my project work. I sincerely acknowledge her expert guidance without which I could
not have completed this project.
I am extremely thankful to Dr. A.K. Bhaduri, Director, Indira Gandhi Centre for
Atomic research (IGCAR) for giving me the permission and providing the facilities to
carry out the project work in Computer Division.
2
TABLE OF CONTENTS
ACKNOWLEDGEMENTS II
ABSTRACT V
1. Introduction 6
1.1 Organisational Overview 6
1.2 Computer Division 6
1.3 High Performance Computing (HPC) 6
1.4 HPC at IGCAR 7
2. Linux Programming 9
2.1 Intro to Linux 9
2.2 Basic Linux Commands 9
3. Parallel Programming 11
3.1 Serial Computing 11
3.2 Parallel Computing 12
3.3 Two Architectures of Parallel Computing 13
3.4 Parallel Programming Models 13
3.4.1 Directives-based parallel programming language 13
3.4.2 Message Passing Interface (MPI) 13
3.5 Open Multi –processing (OpenMP) 14
3.5.1 Introduction 14
3.5.2 OpenMP programming Model 14
3.5.3 Thread based Parallelism 15
3.5.4 Fork-Join Programming model 15
3.5.5 OpenMP API Overview 16
3.5.6 Compilation and Execution 17
3.5.7 Reductions 18
3.5.8 Scheduling 18
3.6 Message Passing Interface (MPI) 20
3
3.6.1 MPI Routines 21
3.6.2 Compilation and Execution 22
4. Parallel Programms 23
4.1 Matrix Multiplication 23
4.1.1 Linear Implementation 23
4.1.2 OpenMP Implementation 24
4.1.3 Message Passing Interface Implementation 26
5. Conclusion 31
6. References 32
4
ABSTRACT
The accelerating rise in problems and their complexity urges the need for higher
computational capabilities. Advancements in processors and other hardware can
only take us this far. The next logical step is to move from single processor devices
to multi-core technologies.
High Performance Computing (HPC) is a pathway for multi-core technology. It

combines the computational capacity of various computers. Communication
between the nodes is vital for any HPC implementation. This interaction between
nodes is made possible through Parallel Programming Libraries such as MPI and
OpenMP.
For a code to be compiled and executed in a HPC environment, it needs to be coded

in such a way that independent processes in the problem are processed in different
compute nodes simultaneously. This process is known as parallelization of a code.
The aim of this project is to parallelize a few problems that are traditionally solved
by linear programming methods, and try executing them in HPC Clusters. Thus, a
study of a few parallel programs were done on MPI and OpenMP libraries and a
performance comparison was carried out.
5
1. INTRODUCTION
1.1 Organisational Overview
Indira Gandhi Centre for Atomic Research (IGCAR), Kalpakkam is one of India’s
premier Atomic research centres. The reactor research centre (RRC) was established
in 1971, which was renamed as the present IGCAR by the then Prime Minister of
India, Rajiv Gandhi in December 1985. IGCAR is engaged in broad-based
multidisciplinary programme of scientific research and advanced engineering
directed towards the development of Fast Breeder Reactor Technology, in India.
This is the second largest establishment of the Department of Atomic Energy (DAE),
next to Bhabha Atomic research Centre (BARC), Mumbai. This centre has a Fast
Breeder Test Rector (FBTR) and various scientific and engineering establishments. In
the campus, a reprocessing plant called The Kalpakkam Reprocessing Plant (KARP),
which reprocess the spent uranium fuel in the Nuclear reactor to retrieve plutonium.
1.2 Computer Division (CD), IGCAR

Computer Division is responsible for development of PFBR operator training
simulator, high performance computing applications, 3D Models and plant
walkthrough, wireless sensor based systems & computational intelligence based
systems for reactor applications, various application software based on user
requirements etc. It is also providing services like high performance computational
facility, internet, e-mail, local area network, anunet, video conference, national
knowledge network, web services, knowledge management, user consultancy etc.
1.3 High Performance Computing (HPC) :

High-performance computing (HPC) is the use of super computers and parallel
processing techniques for solving complex computational problems. HPC technology
focuses on developing parallel processing algorithms and systems by incorporating
both administration and parallel computational techniques.
High-performance computing is typically used for solving advanced problems and

performing research activities through computer modeling, simulation and analysis.
HPC systems have the ability to deliver sustained performance through the
concurrent use of computing resources.
6
HPCs are implemented in several fields of research such as:
 Geographical Surveying and data analysis
 Climate prediction and modelling
 High quality 3D rendering
 Aeronautical Sciences
 Drug Testing
 Material Sciences
1.4 HPC at IGCAR

The high-performance scientific computing facility at Computer Division comprise of
HPC Clusters, Compute-intensive Servers, Graphic-intensive Workstations, high-end
peripherals and advanced application software required to meet the computing
requirements of engineers and scientist of IGCAR.
Supercomputing Clusters
 IVY Cluster :
◦ 400 Compute nodes
◦ 9600 processor cores
◦ 51 Terabytes memory
◦ 200 Teraflops
◦ Clock speed – 2.7 GHz
◦ Dual 12 Core Intel Xeon Processors.
 NEHA Cluster :
7
◦ Clock speed – 2.93 GHz
◦ 6.4 Terabytes memory
◦ 16.7 Teraflops
◦ 6-core Intel Xeon Nehalem Processors.
 XEON Cluster :
◦ Infiniband and Gigabit networks
◦ Dual-processor, Quad-core 64-bit Intel Xeon processor
◦ Clock speed - 3.16 GHz.
◦ 1 TB of distributed memory
◦ 9.5 TFLOPS
8
2. LINUX PROGRAMMING
2.1 Intro to Linux
Linux is a family of open source Unix-like operating systems based on the Linux
Kernel. Linux is designed mainly for mainframes and is in enterprises and
universities.
The experiments of this Project have been carried out in a Linux environment. Some
of the key Linux commands use are specified below. These commands helped
navigate through directories, view, create and modify files.
2.2 Basic Linux Commands:

Sl.No. Name of the Syntax Description
command
1. mkdir mkdir <directory name> Creates the Directories if they do

not already exist. Also used to
create the directory in the current
directory.
2. ls ls List information about the files

(the current directory by default)
Used to list directory contents.
3. cd cd <directory_name> Changes the current directories to

the specified one.
4. vim vim <filename> vim is a text editor that is

compactable to vi. It can be used
to edit all kind of plain text. It is
especially used for editing
programs.
5. cp cp filename1 filename2 Copy the contents from the

source to destination or from
multiple sources to directory.
6. mv mv <source> <destination> Rename or move the content of

one file to another file. Also move
the specified file to another
directory.
7. rm rm <source> Remove the file from the

directory. Get the confirmation
rm -i <source>
before the file removed from the
directory. By default, it does not
9
remove directories.
8. find find <source> Find specific file. All filenames or

directories containing the string
will be printed to the screen.
9. cat cat <filename1> <filename2> Concatenate files and print on the

standard output.
10. echo echo <specify the string to be Display the line of text.
printed>
11. touch touch <filename> This commands is used to create

new file. This command is also
used to change file timestamps.
12. ssh ssh @hostname SSH client(remote login program)

ssh is a program for logging into a
remote machine and for
executing commands on a remote
machine.
13. man man <command> Provides manual details for the

given commands.
10
3. PARALLEL PROGRAMMING
3.1 Serial Computing

Traditionally, software has been written for serial computation. A problem is broken
into a series of instructions, which are executed sequentially one after the other.
That is, only one instruction is executed at any single point in time as shown in
Figure 1.
FIGURE 1: SERIAL COMPUTING PROCEDURE
Example:
In the example shown in Figure 2, do_payroll()is a function that is executed serially.
It consists of several internal calculations that are shown by each single block. Each
stage of calculation is performed one after the other at individual times tn.
FIGURE 2: EXAMPLE FOR WORKING OF SERIAL COMPUTING
11
3.2 Parallel Computing
Parallel Computing refers to the simultaneous use of multiple compute resources to
solve a computational problem. A problem is broken down into discrete parts that
can be solved concurrently, that are further broken down into a set of instructions
which execute simultaneously on different processors, as shown in Figure 3.
FIGURE 3: PARALLEL COMPUTING PROCEDURE
Example:
In the example depicted in Figure 3, the same function do_payroll() seen in Figure 2
is performed in a Parallel fashion. The function is split into 4 parts, each part
responsible for computing payroll for a particular employee. All these four segments
execute simultaneously by different processors. The end result from each processor
is later combined.
FIGURE 4: EXAMPLE FOR WORKING OF PARALLEL COMPUTING
12
3.3 Two Architectures of Parallel Computing:
Shared Memory Architecture: Distributed Memory Architecture:
• Collection of nodes which have multiple
• Multiple cores cores
• Each node uses its own local memory
• Share a global memory space
• Work together to solve a problem
• Cores can efficiently exchange/share • Communicate between nodes and cores

data via messages
• Nodes are networked together
3.4 Parallel Programming Models:

3.4.1 Directives-based parallel programming language:
 OpenMP (Most widely used)
 High Performance Fortran (HPF)
 Directives tell the processor how to distribute data and work across the
processors.
 Directives appear as comments in the serial code.
 Implemented on shared memory architectures.
3.4.2 Message Passing Interface (MPI):

 Pass messages to send/receive data between processes.
 Each process has its own local variables.
 Can be used on either shared or distributed memory architectures.
13
3.5 Open Multi-Processing (OpenMP):
3.5.1 Introduction:
OpenMP is an Application Programming Interface(API) that supports multi-platform

shared memory multiprocessing programming. It is comprised of three primary API
components, ie. Compiler Directives, Runtime library routines and Environment
variables.
3.5.2 OpenMP Programming Model:
OpenMP is designed for multi-processor shared memory machines that may have
the given architectures, namely Uniform and Non-Uniform Memory Access
architectures.
FIGURE 5:UNIFORM MEMORY ACCESS
Uniform Memory Access (UMA): UMA is a shared memory architecture in which all
the processors share the physical memory uniformly. In UA, the access time to a
memory location is independent of which processor makes the request or which
memory chip contains the transferred data.
14
FIGURE 6: NON-UNIFORM MEMORY ACCESS
Non-uniform Memory Access (NUMA): NUMA is a memory architecture in which

the access time depends on the memory location relative to the processor. Under
NUMA, a processor can access its own local memory faster than non-local memory.
3.5.3 Thread based Parallelism:

OpenMP programs accomplish parallelism exclusively through the use of threads. A
thread of execution is the smallest unit of processing that can be scheduled by an
operating system. Typically, the number of threads match the number of machine
processors/cores. However, the actual use of threads is up to the application.
3.5.4 Fork-Join Programming model:

OpenMP is based on the fork-join programming model. All OpenMP programs begin
as a single process (Master Thread). The master thread executes sequentiallyuntil
the first parallel region is encountered. The master thread creates a team of parallel
threads. This process is known as FORK. Once the tasks of the parallel threads are
complete, they synchronize and terminate, leaving only the master thread. This
process of synchronisation is known as JOIN.
FIGURE 7: FORK-JOIN PROGRAMMING MODEL
15
3.5.5 OpenMP API Overview:
The OpenMP API comprises of three distinct components:
 Compiler Directives: A total of 19 compiler directives for converting a serial to
parallel.
 Runtime Library Routines: 32 Runtime library commands help get info on
running threads.
 Environment Variables: 9 different environment variables help set different
conditions for program execution.
(a) Compiler Directives:

OpenMP Compiler Directives are used for :
 Initializing a Parallel region

 Dividing blocks of code among different threads
 Distribution of loop iterations among different threads
 Synchronization among different threads.
Syntax: sentinel directive-name [clause, ...]
Example:#pragma omp parallel default(shared)

private(alpha,beta)
(b) Run-time Libraries:

The OpenMP API includes a wide range of run-time library routines which are used
in setting and querying the number of threads, identifying a thread's ID, querying
wall clock time and resolution. Some of the run-time libraries used are:
OMP_SET_NUM_THREADS Sets the number of threads that will be

used in the next parallel section.
OMP_GET_NUM_THREADS Returns the number of threads that are

currently in the team executing the
16
parallel region.
OMP_GET_THREAD_NUM Return the thread number/ID of the

thread making this call.
OMP_GET_DYNAMIC Used to determine if dynamic thread

adjustment in enabled or not.
OMP_GET_WTIME Provides a wall clock timing routine.
Example:
#include<omp.h>
intomp_get_num_threads(void);
(c) Environment Variables:

Environment variables are used for controlling the execution of parallel code at run-
time. They help control things such as :
 Setting the number of threads
 Specifying how loop iterations are divided
 Binding threads to processors
 Enabling/Disabling dynamic threads
 Setting thread stack size
 Setting thread wait policy
Example:
export OMP_NUM_THREADS=8
3.5.6 Compilation and Execution:

Commands to compile OpenMP Programs in Linux:
17
 gcc -fopenmp -o <OutputFileName><Program.c>
Commands to execute OpenMP Programs in Linux:
 ./OutputFileName
3.5.7 Reductions:
It automatically Privatizes the specified variables and initialises the private instances
with a sensible starting value. At the end of the construct, all partial results are
accumulated into the shared instance to get the final result.
3.5.8 Scheduling:
OpenMP schedules the different iterations of a loop among the different threads
available for the program. The scheduling behaviour of the OpenMP can be
modified according to the user's will. There are five different types for scheduling,
namely:
 static
 dynamic
 guided
 auto
 runtime
Syntax: #pragma omp for schedule(static,4)
An example is shown in Figure 8 depicting the different scheduling methods in

OpenMP. We can clearly see which thread comes into action and when , for the
different scheduling methods in the example.
18
FIGURE 8: DIFFERENT MOETHODS OF SCHEDULING IN OPENMP
 The example loop has 20 iterations and is executed by three threads (T1, T2
and T3).
 The default chunksize for DYNAMIC and GUIDED is 1.
 If a chunksize is specified, the last chunk may be shorter.
 Note that only the STATIC schedules guarantee that the distribution of chunks
among threads stays the same from run to run.
19
3.6 Message Passing Interface (MPI):
Message Passing Interface (MPI), is a standardized means of exchanging messages
between multiple computers running a parallel program across distributed memory.
The MPI defines a standard suite of functions for exchange of data between nodes
and to provide control over the parallel cluster.
MPI programs are made up of communicating processes, each process made up of
its own set of variables. Originally, MPI was designed for distributed memory
architectures, as shown in Figure 8.
FIGURE 9: DISTRIBUTED MEMORY ARCHITECTURE
As architecture trends changed, shared memory SMPs were combined over

networks creating hybrid distributed memory/shared memory systems. An example
of such an architecture is shown in Figure 9.
FIGURE 10: HYBRID DISTRIBUTEDMEMORY/SHARED MEMORY ARCHITECTURE
20
3.6.1 MPI Routines:
Blocking and Non-Blocking:
Most of the MPI point-to-point routines can be use in either blocking or non-
blocking mode.
 Blocking: A blocking send routine will only "return" after it is safe to modify
the sent data for reuse. Safe means that modifications will not affect the data
intended for the receive task.
 Non-Blocking: Non-blocking send and receive routines will return almost
immediately. They do not wait for any communication events to complete
Blocking Send Non-Blocking Send

myvar = 0; myvar = 0;
for (i=1; i<ntasks; i++) { for (i=1; i<ntasks; i++) {

task = i; task = i;
MPI_Send (&myvar ... ... task ...); MPI_Isend (&myvar ... ... task ...);
myvar = myvar + 2 myvar = myvar + 2;
/* do some work */ /* do some work */
} MPI_Wait (...);
}
Some MPI routines that are used:
MPI_Init Initialises the MPI execution

environment.
MPI_Comm_size Returns the total number of MPI

Processes in the specified communicator.
MPI_Comm_rank Returns the rank of the calling MPI

process.
MPI_Wtime Returns an elapsed wall clock time in

seconds.
MPI_Finalize Terminates the MPI execution

environment.
21
MPI_Send Send point-to-point message to
destination process in the communicator
group.
MPI_Recv Receive a point-to-point message from

source process in the communicator
group.
MPI_Bcast To broadcast a message to all the

processors executing the program.
MPI_Barrier Each task, when reaching the

MPI_Barrier call, blocks until all tasks in
the group reach the same MPI_Barrier
call.
3.6.2 Compilation and Execution:

Commands to compile MPI Programs in Linux environment:
 mpicc -o <OutputFileName><ProgramFile.c>
Commands to execute MPI Programs in Linux environment:
 mpirun -np<No. of Processors>./<OutputFileName>
22
4. PARALLEL PROGRAMS
4.1 Matrix Multiplication:
Matrix Multiplication involves elements of each row of the first matrix multiplied
with the elements of each column of the second matrix, and then adding their
products. This forms the required resultant matrix.
Since these operations are independent of each other, they can be parallelized to
run on multiple processors. We split the rows of the first matrix to different
processors and the second matrix is broadcast to all.
The program is done using MPI and OpenMP and the execution times are compared.
Statistics are recorded multiple times by increasing the order of the matrix.
4.1.1 Linear Implementation:

The code was initially executed in the serial fashion and the execution time was
noted. The time was then compared with the execution time obtained in the
OpenMP and MPI methods.
int main()
{
int i,j,k;
double m1[n][n],m2[n][n],m3[n][n];
clock_t startTime, endTime;
float execTime;
startTime = clock();
for(i=0;i<n;i++)
for(j=0;j<n;j++)
m1[i][j]=i+j;
for(i=0;i<n;i++)
for(j=0;j<n;j++)
m2[i][j]=i*j;
for(i=0 ; i< n ; i++)
23
{
for(j=0 ; j< n ;j++)
{
m3[i][j] = 0;
for(k=0;k< n; k++)
{
m3[i][j] = m3[i][j] + m1[i][k] * m2[k][j];
}
printf("The value m3[%d][%d] is %f\n",i,j,m3[i][j]);
}
}
endTime = clock();
execTime = endTime - startTime;
printf("%.15f \n",execTime/CLOCKS_PER_SEC);
return(0);
}
 For the 100 x 100 Matrix, we obtained an execution time of 0.02999s.

 For the 500 x 500 Matrix, we obtained an execution time of 1.37999s.
4.1.2 OpenMP Implementation:

Program code – Parallel Part:
#pragma omp parallel private (i,j) shared (m1,m2,m3)
{
#pragma omp for
for(i=0 ; i< n ; i++)
{
omp_rank = omp_get_thread_num();
for(j=0 ; j< n ;j++)
24
{
m3[i][j] = 0;
for(k=0;k< n; k++)
{
m3[i][j] = m3[i][j] + m1[i][k] * m2[k][j];
}
printf("The value m3[%d][%d] is %f\n",i,j,m3[i][j]);
}}}
An OpenMP code for Matrix Multiplication was written as shown above and
executed for [500 x 500] matrices. The program was written such that each iteration
of the main loop would be handled by a particular thread. This enabled the
simultaneous processing of all the iterations of the loop.
The program was tested with different number of running threads, and their
execution times were noted down in each case. The command "export
omp_num_threads=N" allowed us to specify exactly how many threads were to be
used for a particular execution test.
The execution times obtained for the matrix multiplication are as shown below:
25
GRAPH 1: EXECUTION TIME OF 500X500 MATRIX IN OPENMP
The graph, as it is plainly visible, is unpredictable, as the complexity of the problem

is too negligible to overcome the overhead; the overhead here being the process
time being taken to organize the threads.
4.1.3 Message Passing Interface (MPI) Implementation:

Program Code – Parallel Part:
Master Node:
if (taskid == MASTER)
{
/* Send matrix data to the worker tasks */
averow = NRA/numworkers;
extra = NRA%numworkers;
offset = 0;
mtype = FROM_MASTER;
for (dest=1; dest<=numworkers; dest++)
{
rows = (dest<= extra) ? averow+1 : averow;
//printf("Sending %d rows to task %d offset=%d\n",rows,dest,offset);
MPI_Send(&offset, 1, MPI_INT, dest,mtype, MPI_COMM_WORLD);
26
MPI_Send(&rows, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);
MPI_Send(&a[offset][0], rows*NCA, MPI_LONG_DOUBLE, dest, mtype,
MPI_COMM_WORLD);
MPI_Send(&b, NCA*NCB, MPI_LONG_DOUBLE, dest, mtype, MPI_COMM_WORLD);
offset = offset + rows;
}
/* Receive results from worker tasks */
mtype = FROM_WORKER;
for (i=1; i<=numworkers; i++)
{
source = i;
MPI_Recv (&offset, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);
MPI_Recv (&rows, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);
MPI_Recv (&c[offset][0], rows*NCB, MPI_LONG_DOUBLE, source, mtype, MPI_COMM_WORLD,
&status);
}
Worker Node:
if (taskid> MASTER)
{
mtype = FROM_MASTER;
MPI_Recv(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status);
MPI_Recv(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status);
MPI_Recv(&a, rows*NCA, MPI_LONG_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status);
MPI_Recv(&b, NCA*NCB, MPI_LONG_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status);
for (k=0; k<NCB; k++)
for (i=0; i<rows; i++)
{
c[i][k] = 0.0;
for(j=0; j<NCA; j++)
c[i][k] = c[i][k] + a[i][j] * b[j][k];
}
27
mtype = FROM_WORKER;
MPI_Send(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);
MPI_Send(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);
MPI_Send(&c, rows*NCB, MPI_LONG_DOUBLE, MASTER, mtype, MPI_COMM_WORLD);
}
MPI_Finalize();
The MPI code is split into the Master Node segment and the Worker node Segments.
The Master node splits and sends the rows of the first matrix to the respective
processors, and also collects the result computed by each processor and combines
them to form the final resultant matrix. The second matrix is broadcast to all Worker
Nodes.
The Worker nodes receive their respective rows from the Master Node and carry
out their tasks. Once a node has completed calculations, its result is sent back to the
Master Node. The Master Node finally combines all the results in order, and displays
the final output.
The observed execution times for [100 x 100] and [500 x 500] matrices are as follows:
28
GRAPH 2: EXECUTION TIME OF 100X100 MARIX IN MPI
It is seen from the graph that the execution time has a steady fall in the region of 5
to 20 cores. Thus, it is the optimum number of cores for this task. The spike in the
graph when number of cores exceeds 20 indicates that the overhead of organizing
the cores has exceeded the computational complexity of the program.
GRAPH 3: EXECUTION TIME OF 500X500 MATRIX IN MPI

29
Here, it is seen that the graph is much steeper compared to the 100 x 100 graph.
This shows that MPI is more effective in the case of higher complexity problems
than simpler ones. A small increase the execution time on increasing cores beyond
20 indicates that the overhead has exceeded the computational requirement.
30
5. CONCLUSION
The objective of performing linearly solvable problems in a parallel fashion was

completed successfully. Different OpenMP and MPI Library routines and
programming methods were studied thoroughly. The program was tested several
times by increasing the complexity of the problem and various results were obtained.
Analysis of these results helped us to get a clear insight into the advantages and
disadvantages of Parallel Computing.
It was found that running a parallel program on multiple cores was feasible till as
long as the time required in initialising the processors was little compared to the
that actual execution of the program.
Where Parallel Computing should be used:

 In cases where the problem complexity is very high compared to the process
of initialising the cores/threads for running the program.
 Segments of code that are independent of each other.
Where Parallel Computing should not be used:

 Simple problems, where the overhead for initialising cores is much greater
than solving the problem itself.
 Problems in which result of one stage depends on the result of the previous
stage. (eg. Factorial Calculation). Such codes cannot be parallelized.
31
6. REFERENCES
1. https://computing.llnl.gov/tutorials/mpi
2. https://computing.llnl.gov/tutorials/openMP
3. https://en.wikipedia.org/wiki/Message_Passing _Interface
4. https://en.wikipedia.org/wiki/OpenMP
5. Books: Parallel Programming - Techniques
Parallel Programming with MPI
6. https://igcar.gov.in
7. https://en.wikipedia.org/wiki/Linux
32

Parallel Project Saicharan

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Parallel Project Saicharan

Hochgeladen von

Copyright:

Verfügbare Formate

STUDY ON PARALLEL COMPUTING

INDIRA GANDHI CENTRE FOR ATOMIC RESEARCH

1.1 Organisational Overview 6

1.2 Computer Division 6

1.3 High Performance Computing (HPC) 6

1.4 HPC at IGCAR 7

2.1 Intro to Linux 9

2.2 Basic Linux Commands 9

3.1 Serial Computing 11

3.2 Parallel Computing 12

3.3 Two Architectures of Parallel Computing 13

3.4 Parallel Programming Models 13

3.4.1 Directives-based parallel programming language 13

3.4.2 Message Passing Interface (MPI) 13

3.5 Open Multi –processing (OpenMP) 14

3.5.2 OpenMP programming Model 14

3.5.3 Thread based Parallelism 15

3.5.4 Fork-Join Programming model 15

3.5.5 OpenMP API Overview 16

3.5.6 Compilation and Execution 17

3.6 Message Passing Interface (MPI) 20

3.6.2 Compilation and Execution 22

4.1 Matrix Multiplication 23

4.1.1 Linear Implementation 23

4.1.2 OpenMP Implementation 24

4.1.3 Message Passing Interface Implementation 26

High Performance Computing (HPC) is a pathway for multi-core technology. It

For a code to be compiled and executed in a HPC environment, it needs to be coded

1.1 Organisational Overview

1.2 Computer Division (CD), IGCAR

1.3 High Performance Computing (HPC) :

High-performance computing is typically used for solving advanced problems and

 Geographical Surveying and data analysis

 Climate prediction and modelling

 High quality 3D rendering

1.4 HPC at IGCAR

◦ 400 Compute nodes

◦ 9600 processor cores

◦ Clock speed – 2.7 GHz

◦ Dual 12 Core Intel Xeon Processors.

◦ 1608 processor cores

◦ Clock speed – 2.93 GHz

◦ 6.4 Terabytes memory

◦ 6-core Intel Xeon Nehalem Processors.

◦ 128 Compute nodes

◦ Infiniband and Gigabit networks

◦ Dual-processor, Quad-core 64-bit Intel Xeon processor

◦ Clock speed - 3.16 GHz.

◦ 1024 processor cores

2.2 Basic Linux Commands:

1. mkdir mkdir <directory name> Creates the Directories if they do

2. ls ls List information about the files

3. cd cd <directory_name> Changes the current directories to

4. vim vim <filename> vim is a text editor that is

5. cp cp filename1 filename2 Copy the contents from the

6. mv mv <source> <destination> Rename or move the content of

7. rm rm <source> Remove the file from the

8. find find <source> Find specific file. All filenames or

9. cat cat <filename1> <filename2> Concatenate files and print on the

11. touch touch <filename> This commands is used to create

12. ssh ssh @hostname SSH client(remote login program)

13. man man <command> Provides manual details for the

3.1 Serial Computing

/* do some work / / do some work */