Chemistry Home > Research > Facilities and Centers > Linux Cluster > Queue system
 

Queue System


Contents:


Overview

This document describes the Sun Grid Engine 6.0 queue system on the MSU Chemistry Department Linux cluster. It provides only a brief introduction to help users get starting using SGE. For more details, consult the appropriate man pages. The sge_intro man page (type "man sge_intro") gives a brief description of all of the SGE commands.


Queue Structure

There is a single cluster queue that will accept submitted jobs and route them to available compute nodes. Because there is only one queue, there is no need to specify a queue when submitting jobs. There are no CPU, memory, disk or time limits. However, the queue system will only schedule jobs to run on processors that are idle. If there are not enough free processors to run a submitted job, the job will wait in the queue until enough free processors become available.


Submitting a Job

Jobs are submitted to SGE using the qsub command. Qsub accepts a shell script which contains the commands to be executed when the job runs. You can also instruct qsub to modify the characteristics of the job by embedding switches in the script or by placing them on the command line. There are many switches available; see the qsub man page for details.

A script file can be as simple as a single line of text containing the command to run. Here is an example script file:

  a.out <file.input >file.output

This job could be submitted to the default queue with the following command (assume the script file is named "scriptfile"):

  qsub scriptfile

It is important to note that SGE will start a new login session for your script. One implication of this is that your script will have its working directory set to your home directory. If your script needs to be in a different directory, you will need to add the appropriate "cd" command to your script. You could also use the "-cwd" qsub option to have it start your script from the current working directory instead of your home directory.

Jobs can also be submitted interactively with qsub. For example, instead of putting "a.out ..." in a file and then submitting that file, the job could be submitted interactively as follows:

  qsub <ENTER>
  a.out <file.input >file.output <ENTER>
  <CONTROL-D>

Using the Big Node

The compute node named "compute-4d-0-0" has more memory and scratch disk space than the other nodes. To use this special node, add "-l bignode" to the qsub command. For example:

  qsub -l bignode scriptfile

Submitting Parallel Jobs

SGE uses parallel environments to control the execution of parallel jobs. A parallel environment, or PE, is a collection of settings that is configured by the system administrator. These settings define parameters such as how to allocate nodes and the processors within those nodes. Several PEs are defined on hydra, but most users will need only two: mpich and g03.

Programs that use MPI, such as AMBER, should use the mpich PE. This will allow the queue system to allocate any available processors across all compute nodes. To use the mpich PE, add "-pe mpich n" to the qsub command, where n is the number of processors you wish to request. For example, to submit a job that will use 8 processors, type:

  qsub -pe mpich 8 scriptfile

Note that these 8 processors could be allocated as 4 processors each on 2 nodes, or 2 processors each on 4 nodes, or in any other combination that sums to 8.

For shared memory programs like Gaussian 03, use the g03 PE. This will cause all of the allocated processors to be on the same node. Since the nodes only have 4 processor cores, you should not request more than 4 processors. If you do, your job will just sit in the queue and wait forever. To submit a shared memory 4 processor job, type:

  qsub -pe g03 4 scriptfile

Submitting GAMESS Jobs

Use the following command to submit a GAMESS job:

  gmssub [-b basisfile] [-m email] [-n ncpus] [qsub_args] file_name

where "file_name" is the name of your input file without the ".inp" extension. The optional qsub_args will be passed to SGE. If an email address is given, the output file will be sent to that address upon completion of the job. GAMESS can run in parallel in the cluster, and you must specify the number of processors to use on your job. If you do not wish to run your job in parallel, specify 1 processor.


Submitting Gaussian 03 Jobs

Use the following command to submit a Gaussian 03 job:

  g03sub [-m email] [qsub_args] file_name

where "file_name" is the name of your input file. The optional qsub_args will be passed to SGE. If an email address is given, the output file will be sent to that address upon completion of the job.

Note for parallel use: the g03sub command will look inside your input file for the %NProc= line, and it will automatically add the correct qsub options for a parallel job. You do NOT have to use the "-pe g03" option with g03sub.


Submitting Molpro Jobs

Use the following command to submit a Molpro 2006.01 job:

  m06sub [-n ncpus] file_name

where "file_name" is the name of your input file. To run the job in parallel, add "-n ncpus" to the command, where ncpus is the number of CPUs to use. For example, to use 4 CPUs, type the following:

  m06sub -n 4 file_name

Interactive Jobs

SGE allows interactive programs to be run in the queue system. To start an interactive job, type:

  qlogin

You should see some messages similar to:

  waiting for interactive job to be scheduled ...
  Your interactive job 1564 has been successfully scheduled.

Then you should get logged in to a compute node. At this point, your shell and every command you run will be executed under control of the queue system. When you logout of this shell, your queue session will end.


Getting the Status of a Job

To check the status of a job, use the qstat command. For example:

  qstat

will list all jobs running on hydra. The "qstat -f" command gives another useful view of the queues. It shows the execution queue on each node, along with the job(s) that are running on it and how many processors are being used.


Deleting a Job

The qdel command is used to delete a job from a queue. First get the job ID number by using the qstat command, then type qdel followed by the job ID.

For example, to delete job number 28, type:

  qdel 28

If you delete a running job, the queue system should kill all of the processes related to that job. However, the queue system cannot monitor certain kinds of parallel jobs. If you want to completely kill a parallel job, you should find out which nodes your job is running on ("qstat -f") BEFORE deleting it, then use qdel to delete the job. Finally, login to each of the nodes your job was running on and use the "ps" and "kill" commands to find and kill any of your remaining processes.