Sie sind auf Seite 1von 32

Job Management Systems

SGE
v1.4
Author: Anand Vaidya
anand@vsa-services.com
Why use SGE?
● Maintain order in a shared resource – like queing
up at a movie ticket counter rather than mobbing
the counter
● Apply different usage policies – PhDs and Profs
get better treatment than first year grads
● Everyone gets a fair (!) share of the computing
resource.
What is SGE?
● SGE is a distributed resource management
software
● Provides users the means to submit

computationally demanding tasks to the


SGE system for transparent distribution of
the associated workload.
What is SGE? Layman Terms
● You have a collection of mostly idle Macs,
Windows, Linux and Solaris machines
● You have plenty of computations or

simulations to run.
● Can we just use these machines to run

those computations?
● Who will manage this herd? SGE will...
SGE Overview

Users and their


desktop/laptops

Users' jobs run here


Configs SGE
Rules

Users' jobs run here


How does SGE work?
● Users submit jobs to the Grid Engine.
● Unless resources are immediately

available non-interactive jobs are kept in


queues until resources to execute them
become available.
● Jobs are passed onto the available

execution hosts
● Records of each jobs progress through the

system are kept and reported when


requested.
execd
Sge
Sgemaster,
master, execd
shadows
shadows
execd
Job requests execd
Results,
errors

DRMAA client
(applications)
Supported OS
● Linux 32 and 64 bit
● Solaris (Sparc and x64)
● Windows (exec only)
● OSX
● AIX
● HPUX/IRIX etc
SGE Components
● Hosts
➢ Master (coordinate activities, hold queues)
➢ Shadow Master
➢ Execution (workers)
➢ Administration (sets up system, queues etc)
➢ Submit (users can submit jobs from these)
SGE Components
● Usually the master and admin host are the same
machines
● Queues (defined by the administrator)
● User and Administrator Commands
● Daemons:
● sge_qmaster (Master Daemon),
● sge_schedd (Scheduler Daemon),
● sge_execd (Execution Daemon)
● sge_commd (Communication Daemon)
4 Job Types
● Interactive jobs - user gets back a shell window
● Batch jobs – just run once and store output for
review later
● Array jobs (aka parametric – eg image rendering )
● Parallel (MPI) jobs – Can't describe in one line :-(
Accessing...
● GUI (qmon)
● Command Line / textual (qsub etc)
● Programmatic (DRMAA)

DRMAA= Distributed Resource Management Application API where,


API = Application Programming Interface
Can you see the duplication? DRMA should have been sufficient...
What is a job?
● Describes:
● What to run (program name)
● What environment is needed?
● What resources are needed (how many cpu, how
much RAM etc)
● Email on completion?
● Send output of job to another file?
Queues and Instances
● Queues are logical constructs, shared by all hosts
attached to the queue and cannot run jobs
● Queue Instances actually reside on hosts and
“contain” jobs
● Queue config shared by all instances
● Each instance can have unique properties,
different from Queue
Installing...
● Determine archs you will support and download
appropriate packages.
● Unpack tarballs
● Write auto-install script
● ssh $MASTER ; $SGE_ROOT/inst_sge -m -auto
sge-auto.conf ; /etc/init.d/sgemaster start
● ssh $SHADOW ; $SGE_ROOT/inst_sge -sm -auto
sge-auto.conf; /etc/init.d/sgemaster -shadowd start
● $SGE_ROOT/inst_sge -x -auto sge-auto.conf ;
psh compute /etc/init.d/sgeexecd start
● Check : qhost
● Done!
SGE Commands - qhost
● What is the state of the cluster? How many nodes,
type, load? What is my chance of getting a node?
[root@shark ~]# qhost
HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO
SWAPUS
-------------------------------------------------------------------------------
global - - - - - - -
shark-c00 lx24-amd64 2 2.02 3.9G 240.8M 4.0G 0.0
shark-c02 lx24-amd64 2 2.00 3.9G 214.9M 4.0G 0.0
shark-c03 lx24-amd64 2 1.76 3.9G 215.9M 4.0G 0.0
SGE Commands - qsub
● Create a jobscripts (myjob.sh)
● Submit for execution
$ qsub myjob.sh
Your job 742 ("myjob.sh") has been submitted.
Simplest Job:
[vaidya@shark ~]$ cat myjob.sh
#!/bin/sh
sleep 10
date > /tmp/test1.out.txt
Variations: qsub -cwd myjob.sh
SGE Commands - qstat
● check status of your job:
qstat ; qstat -f ;
qstat -u username ; qstat -j job_id
[root@shark ~]# qstat
●job-ID prior name user state submit/start at queue
slots ja-task-ID
--------------------------------------------------------------------------------------------------------------
---
639 0.55500 HCPDIV7 test1 r 05/17/2006 10:16:31 all.q@shark-c00
1
658 0.55500 HCPDIV1 test1 r 05/17/2006 13:37:35 all.q@shark-c00
1
694 0.55500 FCCDVI test1 r 05/17/2006 23:52:19 all.q@shark-c02
1
695 0.55500 FCCDVI1 test1 r 05/17/2006 23:52:19 all.q@shark-c02
1
SGE Commands - qstat
● Status of the job is indicated by letters as:
qw - waiting t - transfering
r - running s,S - suspended
R - restarted T - threshold
SGE Commands - qdel
● Delete your job, if you wish
qdel 743
vaidya has deleted job 743
SGE Commands - qmon
● qmon is a XWindows GUI tool to
submit/delete/view jobs, configure SGE system
● Example: Submit a job using qmon
– Click the Job Submission icon.
– Click the Job Script file selection icon to open a file selection
box and select your script file. Then, click OK.
– Click the Submit button at the bottom of the Job Submission
dialog.
– After a couple of seconds, you should be able to monitor your
job in the Job Control dialog. Click the Job Control icon in the
QMON control panel.
– You first see it under Pending Jobs, and it quickly moves to
Running Jobs after it gets started.
SGE Commands – qsh, qtcsh
● Submit a Interactive session request:
qlogin
qrsh
● Ensure you have a valid XServer running on
your desktop. Allow remote xclients to display on
your desktop.
● Submit an Interactive session request:
qsh
qtcsh
Note: using this feature needs additional configuration, may
not work otherwise.
SGE Commands – jobscript
● sample job script:
#!/bin/bash
#
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -V
date
sleep 10
env
date
SGE Commands – jobscript
● sample job script:
#!/bin/bash
#
#$ -cwd
#$ -j y
#$ -S /bin/bash
#
$MPI_DIR/mpirun -np $NSLOTS -machinefile
$TMPDIR/machines myparallelprog.exe {infile.txt outfile.txt}
Jobscript – useful directives
● -cwd = change to current dir before running job
● -j y = merge error with stdout
● -r y = code is re-runnable
● -N jname = set the job name
● -l h_rt = 00:30:00 run job for max of 30mins
● -pe mpich – Invoke parallel environment
● -pe mpich-ib – use infiniband parallel environment
● -pe mpich-eth – use ethernet parallel env
● -V = carry all env variable settings
● -M you@uni.edu.sg send email
● -m bes
Jobscript – useful directives
● -A acctname_to_charge
● -a [[CC]yy]MMDDhhmm[.SS] when to run
Admin Commands
Next few slides show commands useful for SGE
admins (not users/researchers)
Admin Commands - qconf
In general,
● qconf -s** to show config
● qconf -m** to modify config
● qconf -M** to import config from text file
● qconf -d** to delete config
SGE Commands – qconf
● Show:
– complexes: qconf -sc
– queues: qconf -sql
– PE: qconf -spl
– exec host: qconf -sel qconf -se c35
– submit hosts: qconf -ss
– admin hosts: qconf -sh
– list calendars qconf -scall
– configuration qconf -sconf
– user list: qconf -suserl
– Scheduler conf: qconf -ssconf
SGE Commands – qping
[anand@shark-c02 ~]$ qping -info shark-c01 537 execd 1
05/24/2006 21:57:34:
SIRM version: 0.1
SIRM message id: 1
start time: 05/24/2006 21:31:37
(1148477497)
run time [s]: 1768
messages in read buffer: 0
messages in write buffer: 0
nr. of connected clients: 2
status: 0
info: dispatcher: R (0.04) | OK
Monitor: disabled
Acknowledgements & Copying
● This material is based on my experience as well as material
collected from SGE documentation.

● This presentation can be redistributed as follows:


➢ No commercial re-distribution: eg, as part of a for-profit
CDROM or as part of your sales pitch. Seek my permission
first.
➢ Must attribute the document creator.
➢ Share alike: If you use this document and enhance it or
modify, share the modifications or the modified document
➢ Which means I apply: Creative Commons License,
http://creativecommons.org/licenses/by-nc-sa/2.5/
The End
● Thanks for your time. If you have any feedback, corrections
or questions please contact me: Anand Vaidya,
anand@vsa-services.com
● This document was created with OpenOffice on Linux. email me if
you want the odp file instead of the pdf

Das könnte Ihnen auch gefallen