Beruflich Dokumente
Kultur Dokumente
J. Lakshmi
3/16/2009
Agenda
Conceptual introduction to batch computing. Different batch schedulers available in SERC. Queue configuration and job-submission information for different schedulers. Generic guidelines while using batch schedulers. Questions!
Advanced User Education Programme SERC,IISc.
3/16/2009
When to use?
Tested programs or codes that need to be run multiple times, with different data and have execution times greater than one hour.
3/16/2009 Advanced User Education Programme SERC,IISc. 3
Submission Client
Batch Scheduler
Job script
Job Queues
Job Scheduler
Execution node
Execution node
Execution node
3/16/2009
3/16/2009
LoadLeveller@SERC
LoadLeveler (LL) is the batch scheduler from IBM. LL manages both serial and parallel jobs over a cluster of servers which consists of a pool of machines or servers, often referred to as a LL cluster. Jobs are allocated to machines in the cluster by a scheduler and the allocation of the jobs depends on the availability of resources within the cluster and various rules defined by the LL administrator. A user submits a job using a job command file which contains details of the executable, it dependencies and LL directives.
3/16/2009 Advanced User Education Programme SERC,IISc. 7
LL@SERC
LL is installed on almost all IBM servers and parallel machines hosted in SERC, which are,
P690 or IBM-Regatta machines P575 machines P720 (256 node) IBM Linux cluster IBM Blue-Gene/L IBM SP3
3/16/2009
3/16/2009
Useful LL commands
llq - Queries information about jobs in the LoadLeveler queues llcancel <jobid> - Cancels one or more jobs from the LoadLeveler queue. llclass - Returns information about classes llsubmit - Submits a job to LoadLeveler llstatus - Returns status information about machines in the LoadLeveler
3/16/2009 Advanced User Education Programme SERC,IISc. 10
LL@P690&P575
There are four logical P690 and two P575 machines that are controlled by a single LL manager. All machines host the AIX OS. Three of the P690 (regatta1/2/3) accept parallel jobs and one (regatta4) is for interactive use. Both P575 machines accept parallel jobs. The machine regatta4 is the submission host for this cluster. Jobs on this cluster are restricted by job _time. Queue information for this cluster is:
Class Wall_clock_limit Max Processor p5task4 4:00:00 4 p5task8 16:00:00 8 p5gtask16 32:00:00 16 For Gaussian p5task16 32:00:00 16
3/16/2009
11
LL@P720 Cluster
P720 is a linux cluster and accepts only parallel jobs. Jobs are controlled using one LL manager for this cluster. Queue information on this cluster is, Class Wall_clock_limit Max Processor TotTasks ptask32 02:00:00 32 32 ptask128 1+08:00:00 128 200 ptask64 2+16:00:00 64 (A total of 200 mpi-tasks are shared between ptask128 and ptask64)
3/16/2009 Advanced User Education Programme SERC,IISc. 12
LL@BlueGene/L
Each node on BlueGene consists of two processors and LL can allocate these in two different ways:
VN mode both processors are allocated for computation. (beneficial for compute intensive jobs) CO mode one processor is allocated for computation and another for communication. (beneficial for compute and communication intensive jobs)
3/16/2009
13
LLQueues on BlueGene/L
Queue Wall_clock_limit No.of jobs No. of Nodes No. of MPI Tasks Allowed Modes == ==================================================================== pnode 32 4:00:00 2 32 32 or 64 Both CO and VN pnode32-24h 24:00:00 2 32 32 or 64 Both CO and VN pnode128 16:00:00 2 128 128 or 256 Both CO and VN pnode128-24h 24:00:00 1 128 128 or 256 Both CO and VN pnode512 48:00:00 1 512 512 or 1024 Both CO and VN pnode1024 120:00:00 4 512 1024 Only VN pnode2048 60:00:00 2 1024 2048 Only VN pnode4096 48:00:00 1 2048 4096 Only VN ======================================================================= Small Block includes: pnode32, pnode32-24h,pnode128,pnode128-24h and pnode512. Small block will have 2 midplanes. Supports both Co and VN mode Big Block includes: pnode1024, pnode2048 and pnode4096. Big block will have six midplanes. Supports only VN mode.
Advanced User Education Programme SERC,IISc.
3/16/2009
14
PBSPro@SERC
PBSPro is the commercial version of OpenPBS/torque, initially developed at NASA labs, now sold by Altair. It is a flexible workload manager that can schedule different jobs for different users on a set of distributed heterogeneous systems. Has capabilities to define system/user/software specific controls on jobs. Currently we are running PBSPro version 8.0.0.
3/16/2009 Advanced User Education Programme SERC,IISc. 15
PBSPro@SERC
Available on all Linux based systems from SUN, HP and SGI. Each PBSPro cluster typically manages a homogeneous set of machines. Four clusters available at SERC, namely
altix altix350-1 altix350-2 hplx
3/16/2009 Advanced User Education Programme SERC,IISc. 16
PBSPro@altix
Consists of a single 32 CPU, SMP machine with hostname altix. Supports only 16 CPU parallel jobs. Jobs restricted by per processor CPU-time, number of jobs in execution and number of jobs per user. Automatic job routing based on job script parameters. Queue parameters:
Queue Memory CPUTime Walltime Node Run Que Lm State ---------------- ------ -------- -------- ---- ----- ----- ---- ----qp100 ----2 12 2 E R route_q ----0 0 -- E R ----- ----2 12
3/16/2009
17
3/16/2009
18
3/16/2009
19
Kill a running job qdel <job_name> Detailed job status tracejob <job_name>
(this command will work correctly only if executed on the node where PBS server is running.)
All PBS directives described in the user guide may not work for an installation. This depends on the configuration. If you want to use something specific please check with your system administrator.
3/16/2009 Advanced User Education Programme SERC,IISc. 21
PBSPro@altix350-1
Consists of a single 16CPU, SMP machine with hostname altix350-1. Supports serial and 4/8 CPU parallel jobs. Jobs restricted by per processor CPU-time, total job CPU-time, number of jobs in execution and number of jobs per user. Automatic job routing based on job script parameters. Queue parameters:
Queue Memory CPUTime Walltime Node Run Que Lm State ---------------- ------ -------- -------- ---- ----- ----- ---- ----route_q ----0 0 -- E R qp_4_32 -128:00:0 --2 32 2 ER qp_4_64 -256:00:0 --1 23 2 ER qp_8_32 -256:00:0 --0 0 1 ER qs_32 -32:00:0 --3 7 4 ER ----- ----6 62
3/16/2009
22
PBSPro@altix350-2
Consists of a single 16CPU, SMP machine with hostname altix350-2. Supports 8 CPU parallel jobs. Jobs restricted by per processor CPU-time, total job CPU-time, number of jobs in execution and number of jobs per user. Automatic job routing based on job script parameters. Queue parameters:
Queue Memory CPUTime Walltime Node Run Que Lm State ---------------- ------------- -------- -------- ----- ---- ----route_q ----0 0 -- E R qp_8_64 -512:00:0 --0 1 2 ER qp_8_100 -800:00:0 --2 34 2 ER ----- ----2 35
3/16/2009
23
PBSPro@hplx&sunlx
Consists of 18 nodes with 10 hplx and 8 sunlx machines. All machines loaded with 64-bit linux OS. Server and Scheduler for this cluster is hplx1_2. Currently undergoing reconfiguration. For details contact Mr. Chandrappa <chandru@serc.iisc.ernet.in> Supports only serial jobs. Jobs restricted by per processor CPU-time, total job CPU-time, number of jobs in execution and number of jobs per user. Automatic job routing based on job script parameters. Queue parameters: Queue Memory CPUTime Walltime Node Run Que Lm State ---------------- ----------- ---------- ------- ---- ----- ----- ---- ---- ------qh64 -64:00:00 --2 0 24 D R qh16 -16:00:00 --0 0 24 D R route ----0 0 --- E R qh8 -08:00:00 --0 0 24 E R qh256 -256:00:0 --2 0 24 D R qh32 -32:00:00 --0 0 24 D R ----- ----4 0 Queue specific details can be found by executing the command qmgr qmgr> list queue qh64
Advanced User Education Programme SERC,IISc. 24
3/16/2009
LSF@SERC
Load Sharing Facility (or simply LSF) is a commercial computer software job scheduler sold by Platform Computing. It can be used to execute batch jobs on networked Unix and Windows systems on many different architectures. LSF version 4.1 is currently installed on the Compaq ES40 machines (commonly known as alpha servers). In LSF there is no concept of job script. You can create a shell script that contains details of your executable and its dependencies and submit this as a job to LSF. You can also use the various job submission options to specify the executable dependencies.
3/16/2009 Advanced User Education Programme SERC,IISc. 25
LSF@alphas4
The alpha server cluster consists of 5 ES40 servers, each with 4 CPUs. The cluster allows only serial jobs and has alphas4 as the submission host. All other machines are execution nodes. Queue configuration for the cluster: QNAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP 8hr 10 Open:Active 4 1 - 0 0 0 0 64hr 6 Open:Active 4 1 - 0 0 0 0 32hr 6 Open:Active 4 1 - 0 0 0 0 16hr 6 Open:Active 4 1 - 0 0 0 0 128hr 4 Open:Active 4 1 - 0 0 0 0 256hr 2 Open:Active 4 1 - 0 0 0 0 g98_q 1 Open:Active 1 1 - 0 0 0 0 unlimited 1 Open:Active 4 1 - 0 0 0 0 Queue specific details can be found by executing the command
bqueues l <queue_name>
3/16/2009 Advanced User Education Programme SERC,IISc. 26
When in doubt go to
http://www.serc.iisc.ernet.in/ComputingFacilities/software/ software.htm
Thankyou!
ANY QUESTIONS?
3/16/2009
29