Sie sind auf Seite 1von 32

Job Management Systems

Author: Anand Vaidya
Why use SGE?
● Maintain order in a shared resource – like queing
up at a movie ticket counter rather than mobbing
the counter
● Apply different usage policies – PhDs and Profs
get better treatment than first year grads
● Everyone gets a fair (!) share of the computing
What is SGE?
● SGE is a distributed resource management
● Provides users the means to submit

computationally demanding tasks to the

SGE system for transparent distribution of
the associated workload.
What is SGE? Layman Terms
● You have a collection of mostly idle Macs,
Windows, Linux and Solaris machines
● You have plenty of computations or

simulations to run.
● Can we just use these machines to run

those computations?
● Who will manage this herd? SGE will...
SGE Overview

Users and their


Users' jobs run here

Configs SGE

Users' jobs run here

How does SGE work?
● Users submit jobs to the Grid Engine.
● Unless resources are immediately

available non-interactive jobs are kept in

queues until resources to execute them
become available.
● Jobs are passed onto the available

execution hosts
● Records of each jobs progress through the

system are kept and reported when

master, execd
Job requests execd

DRMAA client
Supported OS
● Linux 32 and 64 bit
● Solaris (Sparc and x64)
● Windows (exec only)
SGE Components
● Hosts
➢ Master (coordinate activities, hold queues)
➢ Shadow Master
➢ Execution (workers)
➢ Administration (sets up system, queues etc)
➢ Submit (users can submit jobs from these)
SGE Components
● Usually the master and admin host are the same
● Queues (defined by the administrator)
● User and Administrator Commands
● Daemons:
● sge_qmaster (Master Daemon),
● sge_schedd (Scheduler Daemon),
● sge_execd (Execution Daemon)
● sge_commd (Communication Daemon)
4 Job Types
● Interactive jobs - user gets back a shell window
● Batch jobs – just run once and store output for
review later
● Array jobs (aka parametric – eg image rendering )
● Parallel (MPI) jobs – Can't describe in one line :-(
● GUI (qmon)
● Command Line / textual (qsub etc)
● Programmatic (DRMAA)

DRMAA= Distributed Resource Management Application API where,

API = Application Programming Interface
Can you see the duplication? DRMA should have been sufficient...
What is a job?
● Describes:
● What to run (program name)
● What environment is needed?
● What resources are needed (how many cpu, how
much RAM etc)
● Email on completion?
● Send output of job to another file?
Queues and Instances
● Queues are logical constructs, shared by all hosts
attached to the queue and cannot run jobs
● Queue Instances actually reside on hosts and
“contain” jobs
● Queue config shared by all instances
● Each instance can have unique properties,
different from Queue
● Determine archs you will support and download
appropriate packages.
● Unpack tarballs
● Write auto-install script
● ssh $MASTER ; $SGE_ROOT/inst_sge -m -auto
sge-auto.conf ; /etc/init.d/sgemaster start
● ssh $SHADOW ; $SGE_ROOT/inst_sge -sm -auto
sge-auto.conf; /etc/init.d/sgemaster -shadowd start
● $SGE_ROOT/inst_sge -x -auto sge-auto.conf ;
psh compute /etc/init.d/sgeexecd start
● Check : qhost
● Done!
SGE Commands - qhost
● What is the state of the cluster? How many nodes,
type, load? What is my chance of getting a node?
[root@shark ~]# qhost
global - - - - - - -
shark-c00 lx24-amd64 2 2.02 3.9G 240.8M 4.0G 0.0
shark-c02 lx24-amd64 2 2.00 3.9G 214.9M 4.0G 0.0
shark-c03 lx24-amd64 2 1.76 3.9G 215.9M 4.0G 0.0
SGE Commands - qsub
● Create a jobscripts (
● Submit for execution
$ qsub
Your job 742 ("") has been submitted.
Simplest Job:
[vaidya@shark ~]$ cat
sleep 10
date > /tmp/test1.out.txt
Variations: qsub -cwd
SGE Commands - qstat
● check status of your job:
qstat ; qstat -f ;
qstat -u username ; qstat -j job_id
[root@shark ~]# qstat
●job-ID prior name user state submit/start at queue
slots ja-task-ID
639 0.55500 HCPDIV7 test1 r 05/17/2006 10:16:31 all.q@shark-c00
658 0.55500 HCPDIV1 test1 r 05/17/2006 13:37:35 all.q@shark-c00
694 0.55500 FCCDVI test1 r 05/17/2006 23:52:19 all.q@shark-c02
695 0.55500 FCCDVI1 test1 r 05/17/2006 23:52:19 all.q@shark-c02
SGE Commands - qstat
● Status of the job is indicated by letters as:
qw - waiting t - transfering
r - running s,S - suspended
R - restarted T - threshold
SGE Commands - qdel
● Delete your job, if you wish
qdel 743
vaidya has deleted job 743
SGE Commands - qmon
● qmon is a XWindows GUI tool to
submit/delete/view jobs, configure SGE system
● Example: Submit a job using qmon
– Click the Job Submission icon.
– Click the Job Script file selection icon to open a file selection
box and select your script file. Then, click OK.
– Click the Submit button at the bottom of the Job Submission
– After a couple of seconds, you should be able to monitor your
job in the Job Control dialog. Click the Job Control icon in the
QMON control panel.
– You first see it under Pending Jobs, and it quickly moves to
Running Jobs after it gets started.
SGE Commands – qsh, qtcsh
● Submit a Interactive session request:
● Ensure you have a valid XServer running on
your desktop. Allow remote xclients to display on
your desktop.
● Submit an Interactive session request:
Note: using this feature needs additional configuration, may
not work otherwise.
SGE Commands – jobscript
● sample job script:
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -V
sleep 10
SGE Commands – jobscript
● sample job script:
#$ -cwd
#$ -j y
#$ -S /bin/bash
$MPI_DIR/mpirun -np $NSLOTS -machinefile
$TMPDIR/machines myparallelprog.exe {infile.txt outfile.txt}
Jobscript – useful directives
● -cwd = change to current dir before running job
● -j y = merge error with stdout
● -r y = code is re-runnable
● -N jname = set the job name
● -l h_rt = 00:30:00 run job for max of 30mins
● -pe mpich – Invoke parallel environment
● -pe mpich-ib – use infiniband parallel environment
● -pe mpich-eth – use ethernet parallel env
● -V = carry all env variable settings
● -M send email
● -m bes
Jobscript – useful directives
● -A acctname_to_charge
● -a [[CC]yy]MMDDhhmm[.SS] when to run
Admin Commands
Next few slides show commands useful for SGE
admins (not users/researchers)
Admin Commands - qconf
In general,
● qconf -s** to show config
● qconf -m** to modify config
● qconf -M** to import config from text file
● qconf -d** to delete config
SGE Commands – qconf
● Show:
– complexes: qconf -sc
– queues: qconf -sql
– PE: qconf -spl
– exec host: qconf -sel qconf -se c35
– submit hosts: qconf -ss
– admin hosts: qconf -sh
– list calendars qconf -scall
– configuration qconf -sconf
– user list: qconf -suserl
– Scheduler conf: qconf -ssconf
SGE Commands – qping
[anand@shark-c02 ~]$ qping -info shark-c01 537 execd 1
05/24/2006 21:57:34:
SIRM version: 0.1
SIRM message id: 1
start time: 05/24/2006 21:31:37
run time [s]: 1768
messages in read buffer: 0
messages in write buffer: 0
nr. of connected clients: 2
status: 0
info: dispatcher: R (0.04) | OK
Monitor: disabled
Acknowledgements & Copying
● This material is based on my experience as well as material
collected from SGE documentation.

● This presentation can be redistributed as follows:

➢ No commercial re-distribution: eg, as part of a for-profit
CDROM or as part of your sales pitch. Seek my permission
➢ Must attribute the document creator.
➢ Share alike: If you use this document and enhance it or
modify, share the modifications or the modified document
➢ Which means I apply: Creative Commons License,
The End
● Thanks for your time. If you have any feedback, corrections
or questions please contact me: Anand Vaidya,
● This document was created with OpenOffice on Linux. email me if
you want the odp file instead of the pdf

Das könnte Ihnen auch gefallen