Sie sind auf Seite 1von 13

Sun_Gridengine

Contents
1 Howto 1.1 Set up the gridengine environment 1.2 Check if gridengine is running 1.3 Start gridengine 1.4 Stop gridengine 1.5 Check job status for a user 1.6 List jobs for all users 1.7 Delete all jobs of a user 1.8 Delete a job 1.9 Find out why a job isn't running 1.10 Clear error state of a job in an error state 1.11 Submit a job 1.12 Understand the accounting file 1.13 Add a server as a submit host 1.14 Add a server as an execution host 1.15 Move head node to another server 1.16 Add a user to Sun Gridengine (with example) 2 CLI Commands 2.1 Job management 2.2 User management 2.3 User access list management 2.4 Host management 2.5 Host group management 2.6 Queue management 2.7 Queue complexes 2.8 Project management 2.9 Configuration 2.9.1 Automation 2.10 Other 2.11 Installation 2.12 export 2.13 import

Howto
Set up the gridengine environment
You won't be able to run any gridengine commands until you've set up your gridengine environment.
Log in to a gridengine client (stpuxa01, or one of the gridengine nodes) Source the gridengine environment Production

Contents

Sun_Gridengine
. /sge/prod/default/common/settings.sh source /sge/prod/default/common/settings.csh . /usr/sge/default/common/settings.sh source /usr/sge/default/common/settings.sh Test . /sge/test/default/common/settings.sh source /sge/test/default/common/settings.csh For old test farm, log into one of the old farm nodes, wallace, or king01/king02 . /usr/sge/default/common/settings.sh source /usr/sge/default/common/settings.sh

Check if gridengine is running


Run qhost, which displays all nodes. If you receive a host list than the main system is running. Statistics fields for hosts where gridengine cannot communicate will display stats as "-"

Start gridengine

On qmaster and qmaster shadow host (failover) /etc/init.d/sgemaster start /etc/init.d/sgeexecd start * execd will fail to start if the head node is not configured to run jobs, normal for prod, tes interactive jobs on head nodes On execution hosts /etc/init.d/sgeexecd start

Stop gridengine

On qmaster and qmaster shadow host (failover) /etc/init.d/sgemaster stop /etc/init.d/sgeexecd stop * execd will fail to start if the head node is not configured to run jobs, normal for prod, tes interactive jobs on head nodes On execution hosts /etc/init.d/sgeexecd stop

Check job status for a user


qstat -u <userid>

List jobs for all users


qstat -u \*

Set up the gridengine environment

Sun_Gridengine

Delete all jobs of a user


qdel -u <userid>

Delete a job
qdel <job id#>

Find out why a job isn't running


qstat -j <jobid>

Clear error state of a job in an error state Submit a job Understand the accounting file Add a server as a submit host
You usually need to set up hosts for both test and prod farms. Setup the gridengine environment for each farm respectively from an administrative host. Use stpuxa01, or any of the farm nodes (prod cannot work against test and vice versa)
Production (bourne): . /sge/prod/default/common/settings.sh (csh) : source /sge/prod/default/common/settings.csh Test (bourne): . /sge/test/default/common/settings.sh (csh) : source /sge/test/default/common/settings.csh

OLD prod / test farms You must log in to either sg-001 (prod) or sgt-001 (test) to make the changes
(bourne): . /usr/sge/default/common/settings.sh (csh) : source /usr/sge/default/common/settings.sh

Add the new server as a submit host


qconf -as (servername)

Verify the server shows up in the submit host list


qconf -ss

Delete all jobs of a user

Sun_Gridengine

Add a server as an execution host Move head node to another server Add a user to Sun Gridengine (with example)
Adding a user to Sun Gridengine envolves the following steps, performed on each farm in question. Typically a farm request for access will always be created on both production and test, unless specified otherwise.

1. Systems access needs to add the user's account to the sge-users unix group 2. The user's department needs to be identified to decide which project to assign them to (Queues are set up by department) 3. The user needs to be added to the access list for their department. 4. The user needs to have a gridengine registered account created which specifies their default project when submitting a job. 5. A test job needs to be submitted as the user to verify that the job runs through the farm succ

Administration work needs to be done from a specified admin host. Stpuxa01 is an admin host. The below example outlines the process for adding a user account for gridengine
Do the required paperwork

1. Create a workorder for systems access team to add the user's account to the unix sge-admin group via magic. You cannot proceed until the account has been created. 2. Verify that the account exists on a gridengine execution node
rsh into any node (i.e. sttge001) su - (userid)

3. Determine the user's organization to assign access queue access. Currently queues consist of the following:
aa (AnimalAG queue) bli (Bioinformatics) tcc (TPS) production-1 (For high-priority interactive web-applications) Set up your gridengine environment to allow administration

4. Log in to stpuxa01 as your own user account 5. Source (do not execute) the proper gridengine setup file for the farm you're working:
For production (new) (bash) . /sge/prod/default/common/settings.sh (csh) source /sge/prod/default/common/settings.csh For test (new) (bash) . /sge/test/default/common/settings.sh (csh) source /sge/test/default/common/settings.csh

Add a server as an execution host

Sun_Gridengine 6. Verify that your environment is setup with the qhost command before you continue.
This will produce a list of gridengine execution hosts similar to below $ qhost HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS ------------------------------------------------------------------------------global stpge001 lx24-amd64 4 0.00 15.7G 318.6M 32.0G 120.0K stpge002 lx24-amd64 4 0.03 15.7G 326.8M 32.0G 120.0K stpge003 lx24-amd64 4 0.00 15.7G 307.2M 32.0G 120.0K stpge004 lx24-amd64 4 0.02 15.7G 306.2M 32.0G 120.0K (etc) ADD THE ACCESS

1. Obtain a list of ACLs to determine the associated ACL name for the department
$ qconf -sul aausers bliusers deadlineusers defaultdepartment general production-1-users sysadmins tccusers

2. Add the user in question to their appropriate access list. The qconf -mu (modify-user) command is used, which opens a vi editor window with a list of users. The below example is for bliusers ACL.
qconf -mu bliusers name type fshare oticket entries bliusers ACL 0 0 aabouk,bbqiu,bgmart,bip,block1,block2,bwang1,dkkova,genquest,gsa, \ jetaba,jjliu2,jruan,llguo,lllutf,mmlu,pmi,ppli,psa,slstri,ssarif, \ ssyang3,tvvenk,wwwu1,ychen3,yycao,yzhang2,amwoll,yykong,xxfu

Add the user ID to the end of the list, then save and exit the file. 3. Verify the user has been added to the access list using the qconf -su (show-user) command.
qconf -su bliusers

4. Create the gridengine registered account for the user. READ NOTE BELOW.

** Note: If a user has tried to use a grid without access a temporary registered account is creat automatically. The user's account shows up under the registered users list, but will not have default project assigned and will have a non-zero value in the delete_time field (the date when this account will be automatically deleted)

Add a user to Sun Gridengine (with example)

Sun_Gridengine 5. Check if the user has a previously existing registered account


qconf -suserl | grep (userid)

6. Create the registered user account / Modify the existing if present


If the account exists: qconf -muser (userid) If not: qconf -auser (userid)

You'll be presented with either a blank template (below) or a template with the user's pre-existing information.
name template oticket 0 fshare 0 delete_time 0 default_project NONE The first field, name, needs to contain the userid (instead of the word "template"). The delete_time field needs to be 0 (if automatically created will have a date here) The default_project needs to be the user's default project (obtained with qconf -sprjl)

Save and exit 7. Verify the user account was created with the qconf -suser (show-user) command.
qconf -suser (userid)

Test the account

You'll need to log into a gridengine client, "su -" to the user's account, then submit a job and verify that it successfully runs through the farm. 1. Login in to a gridengine client that the user has access to
** If you're unsure of which clients are submit hosts, you can use the qconf -ss command from stpuxa01 to show a list of all registered submit hosts.

2. "su -" to the user's account, then set up the gridengine environment
For production (new) (bash) . /sge/prod/default/common/settings.sh (csh) source /sge/prod/default/common/settings.csh For test (new) (bash) . /sge/test/default/common/settings.sh (csh) source /sge/test/default/common/settings.csh

3. Verify that your environment is setup with the qhost command before you continue. Add a user to Sun Gridengine (with example) 6

Sun_Gridengine
This will produce a list of gridengine execution hosts similar to below $ qhost HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS ------------------------------------------------------------------------------global stpge001 lx24-amd64 4 0.00 15.7G 318.6M 32.0G 120.0K stpge002 lx24-amd64 4 0.03 15.7G 326.8M 32.0G 120.0K stpge003 lx24-amd64 4 0.00 15.7G 307.2M 32.0G 120.0K stpge004 lx24-amd64 4 0.02 15.7G 306.2M 32.0G 120.0K (etc)

4. Submit a job as the user to ensure it runs successfully


qsub -b y -o /dev/null -e /dev/null uname -a ** Note: You can also redirect stdout (-o) and stderr (-e) from the job to a file in the user's home directory to verify that the job ran, which is sometimes easier than monitoring the queue directly.

5. Use the qstat to monitor the job. It will start in queue-wait (qw), then enter transition (t), and then run (r). This may happen very quickly, if the farm is not loaded. If the organization's queues are full, you may need to wait in queue for the job to run.
qstat

If the job runs through successfully on the farm, then this step is done. Be sure to perform the work on both production as well as test farm if required.

CLI Commands
Job management
qstat -j <job id> qmod -cj qdel <job id> Show detail information on job (shows why didn't run if error) Clear job error state Delete a job

User management
qconf -sm qconf -am user[,...] qconf -dm user[,...] qconf -so qconf -ao user[,...] qconf -do user[,...] qconf -suserl qconf -suser user[,...] Show manager list Add user to manager list Remove user from managers list Show operators list Add user to operator list Delete user from operator list Show defined users Show definition of specified users

CLI Commands

Sun_Gridengine

User access list management


qconf -sul qconf -su acl_name qconf -mu acl_name qconf -au user[,...] acl[,...] qconf -auser Displays defined ACLs Displays configuration of defined ACL Modify ACL Add users to ACLs Adds user to list of registered users

Create ACL - Unknown (Add user to non-existent ACL then modify)

Host management
qconf -sh qconf -ah host[,...] qconf -dh host[,...] qconf -sel qconf -ae [host_template] qconf -de host[,...] qconf -ss qconf -as host[,...] qconf -ds host[,...] Display administrative host list Add host to administrative host list Delete host from administrative host list Display execution host list Add execution host Delete host from execution host list Display submit host list Add host to submit host list Delete host from submit host list

Host group management


qconf qconf qconf qconf qconf -shgrpl -shgrp group -ahgrp group -mhgrp group -dhgrp group Displays Displays Adds new Modifies list of all host groups host group entries for <group> host group <group> existing host group <group>

Queue management
qconf qconf qconf qconf -aq <qname> -mq <qname> -sq <qname> -sql Add new queue Modify existing queuename Display queue properties Display all queues

Queue complexes
qconf -sc Display the complex configuration

Project management
qconf -sprjl Show all defined projects

User access list management

Sun_Gridengine

Configuration
qconf -sconf [host,...|global] qconf -sconfl qconf -se host qconf -sel qconf -aconf host[,...] qconf -dconf host[,...] Show cluster configuration for hosts / globally Displays hosts which configurations are available Display definition of execution host Display execution host list Add host configuration entries Delete host configuration entries

Automation
qconf qconf qconf qconf qconf qconf qconf -Aconf file_list -Ae fname -Ahgrp fname -Aq fname -Au fname -Auser fname -Msconf fname Add host configurations from filenames (matching hostname) Add execution host from file fname Add host group configuration from file fname Add queue defined in file fname Add ACL from file fname Add registered user defined in file fname Overwrite scheduler config from file fname

Other
qconf qconf qconf qconf qconf -secl -sep -ssconf -sss -clearusage Display event client list Displays licensed processors per host & total Display scheduler configuration Display scheduler host Clear all user / project usage from sharetree

Installation
Configure server via standard build procedures Mount gridengine shared directory (/opt/sge/test) (or /opt/sge/prod) Add the following entries into /etc/services if not present sge_qmaster 1536/tcp sge_execd 1537/tcp Add gridengine service account into /etc/passwd sgeadmin::153:153:GridEngine Service Account:/tmp:/bin/sh Run pwconv Lock password in /etc/shadow sgeadmin:*:13710:0:99999:7:::0 Add gridengine groups to /etc/group sge-25000::25000: sge-25001::25001: sge-25002::25002: sge-25003::25003: sge-25004::25004: sge-25005::25005: sge-25006::25006: sge-25007::25007: sge-25008::25008: sge-25009::25009: sge-25010::25010: sge-25011::25011:

Configuration

Sun_Gridengine
sge-25012::25012: sge-25013::25013: sge-25014::25014: sge-25015::25015: sge-25016::25016: sge-25017::25017: sge-25018::25018: sge-25019::25019: sge-25020::25020: sge-25021::25021: sge-25022::25022: sge-25023::25023: sge-25024::25024: sge-25025::25025: sge-25026::25026: sge-25027::25027: sge-25028::25028: sge-25029::25029: sge-25030::25030: sge-25031::25031: sge-25032::25032: sge-25033::25033: sge-25034::25034: sge-25035::25035: sge-25036::25036: sge-25037::25037: sge-25038::25038: sge-25039::25039: sge-25040::25040: sge-25041::25041: sge-25042::25042: sge-25043::25043: sge-25044::25044: sge-25045::25045: sge-25046::25046: sge-25047::25047: sge-25048::25048: sge-25049::25049: sge-25050::25050: sge-25051::25051: sge-25052::25052: sge-25053::25053: sge-25054::25054: sge-25055::25055: sge-25056::25056: sge-25057::25057: sge-25058::25058: sge-25059::25059: sge-25060::25060: sge-25061::25061: sge-25062::25062: sge-25063::25063: sge-25064::25064: sge-25065::25065: sge-25066::25066: sge-25067::25067: sge-25068::25068: sge-25069::25069: sge-25070::25070: sge-25071::25071: sge-25072::25072: sge-25073::25073:

Installation

10

Sun_Gridengine
sge-25074::25074: sge-25075::25075: sge-25076::25076: sge-25077::25077: sge-25078::25078: sge-25079::25079: sge-25080::25080: sge-25081::25081: sge-25082::25082: sge-25083::25083: sge-25084::25084: sge-25085::25085: sge-25086::25086: sge-25087::25087: sge-25088::25088: sge-25089::25089: sge-25090::25090: sge-25091::25091: sge-25092::25092: sge-25093::25093: sge-25094::25094: sge-25095::25095: sge-25096::25096: sge-25097::25097: sge-25098::25098: sge-25099::25099:

Add all required nfs mounts to /etc/fstab, create mountpoints and mount Configure qmaster (master node, i.e. sttge000)
cd /opt/sge/test/ ./install_qmaster Use account sgeadmin Use group range 25000-25099 Use classic spooling

Configure execution hosts


qconf -ah (nodename) cd /opt/sge/test ./install_execd (Defaults) Add queue instance

export
qconf -sm > SGE.MANAGERS qconf -so > SGE.OPERATORS qconf -suserl > SGE.REGISTERED_USERS for USR in `qconf -suserl` ; do qconf -suser $USR > REGISTERED_USER.$USR ; done qconf -sconfl > SGE.ACLS for ACL in `qconf -sul` ; do echo $ACL ; qconf -su $ACL > ACL.$ACL ; done for PRJ in `qconf -sprjl` ; do echo $PRJ ; qconf -sprj $PRJ > PROJECT.$PRJ ; done qconf -sh > SGE.ADMINISTRATIVE_HOST_LIST qconf -sel > SGE.EXECUTION_HOST_LIST qconf -ss > SGE.SUBMIT_HOST_LIST qconf -shgrpl > SGE.HOST_GROUPLIST

export

11

Sun_Gridengine
for HGRP in `qconf -shgrpl` ; do echo $HGRP ; qconf -shgrp $HGRP > HOSTGROUP.$HGRP ; done qconf -sql > SGE.QUEUES for QUE in `qconf -sql` ; do echo $QUE ; qconf -sq $QUE > QUEUE.$QUE ; done qconf -sc > SGE.COMPLEXES qconf -sprjl > SGE.DEFINED_PROJECTS qconf -sconf global > SGE.GLOBAL_CONFIG for HCNF in `qconf -sconfl` ; do echo $HCNF ; qconf -sconf $HCNF > HOST_CONFIG.$HCNF ; done qconf -sel > SGE.EXECUTION_HOSTS qconf -sep > SGE.LICENSED_PROCESSORS qconf -ssconf > SGE.SCHEDULER_CONFIG

import
Order 1. Host groups 2. Registered users 3. ACLs 4. Queues 5. Managers

for X in `ls | cut -f1 -d. | sort -u` ; do echo $X ; mkdir $X ; for FL in `ls $X.*` ; do mv $FL $

1. Replace hostgroups with new hosts

cd HOSTGROUP mkdir ../HOSTGROUP-new for X in `ls` ; do printf "group_name $X\nhostlist sttge000.monsanto.com sttge001.monsanto.com st cd ../HOSTGROUP-new for X in `ls` ; do echo $X ; qconf -Ahgrp $X ; done

2. Add user ACLs


cd ACL for x in `ls` ; do echo $x ; qconf -Au $x ; done

3. Import projects
cd PROJECT for x in `ls` ; do echo $x ; qconf -Aprj $x ; done

4. Add registered users if desired


cd REGISTERED_USER for x in `ls` ; do echo $x ; qconf -Auser $x ; done (must exist on system)

5. Add host configuration


cd HOST_CONFIG for x in `ls` ; do echo $x ; qconf -Aconf $x ; done

6. Add queue definitions import 12

Sun_Gridengine
cd QUEUES for x in `ls` ; do echo $x ; qconf -Aq $x ; done

6. Add administrators/operators
cd SGE for X in `cat MANAGERS` ; do echo $X ; qconf -am $X ; done for X in `cat OPERATORS` ; do echo $X ; qconf -ao $X ; done qconf -ssconf > ../backup/BACKUP.SCHEDULER_CONFIG qconf -Msconf SCHEDULER_CONFIG qconf -sconf global > ../backup/BACKUP_GLOBAL_CONFIG

global import? SHARETREE


[-astnode node_shares_list] [-astree] [-Astree fname] [-clearusage] [-dstnode node_list] [-dstree] [-mstnode node_shares_list] [-Mstree fname] [-mstree] [-shgrp_tree group] [-sstnode node_list] [-rsstnode node_list] [-sstree] add sharetree node(s) create/modify the sharetree create/modify the sharetree from file clear all user/project sharetree usage delete sharetree node(s) delete the sharetree modify sharetree node(s) modify/create the sharetree from file modify/create the sharetree show host group and used hostgroups as tree show sharetree node(s) show sharetree node(s) and its children show the sharetree

Create registered user importrs from acl's for GRP in `qconf -sul | grep "users$"` ; do PRJ=`echo $GRP | sed 's/users//g'` ; mkdir $PRJ ; fo

Modify default project all users for X in `qconf -suserl` ; do echo $X ; qconf -suser $X | grep -v "^default_project" > out/$X ; GG=`qconf -suser $X | grep "^default_project" | awk '{print $2}'` ; echo "default_project $GG-proj" >> out/$X ; done Delete project jobs out of userid submitted

for X in `qstat -u jafish | egrep -v "^job-ID|---------" | awk '{print $1}'` ; do PRJ=`qstat -j $

import

13

Das könnte Ihnen auch gefallen