Connecting to Hydra
Users may connect to hydra.cem.msu.edu (the head node) with secure shell, or ssh. Files can be transferred to/from hydra with sftp or scp. Only the head node will accept connections from the outside world. The compute nodes are on a separate, private network and will only accept connections from the head node.
Mac OS X users can use the OS X command line versions of ssh and sftp/scp. A graphical version called Fugu is available on the web.
Windows ssh/sftp clients include Secure Shell Client and Putty.
Node Naming Convention
- Head node
The head node is named hydra.cem.msu.edu to the outside world. The compute nodes know it as hydra.local on the internal network.- Regular (2-disk) compute nodes
The 2-disk compute nodes are named compute–2d–<rack#>–<slot#>, where the rack and slot numbers start at 0 and go up. There is only one rack (number 0). There are 11 regular compute nodes, so they are named compute–2d–0–0 through compute–2d–0–10. These nodes also have a shorter nickname: c2d0–0 through c2d0–10.- Big (4-disk) compute node
The 4-disk compute nodes are named compute–4d–<rack#>–<slot#>, where the rack and slot numbers start at 0 and go up. There is currently only one node of this type in the cluster, so it is named compute–4d–0–0 or c4d0–0.
Hardware
Hydra is a 13 node cluster from Western Scientific consisting of one head node, 11 regular compute nodes, and 1 "big" compute node with extra memory and disk.
Head Node
- two dual-core AMD Opteron 265 processors, 1.8GHz, 2MB cache
- 4GB RAM
- two 80GB 7200 RPM SATA disks (OS), RAID 1
- two 400GB 7200 RPM SATA disks (apps and home directories), RAID 1
- ARECA SATA RAID controller
Regular Compute Nodes
- two dual-core AMD Opteron 265 processors, 1.8GHz, 2MB cache
- 4GB RAM
- two 250GB 7200 RPM SATA disks
Big Compute Node
- two dual-core AMD Opteron 265 processors, 1.8GHz, 2MB cache
- 16GB RAM
- four 250GB 7200 RPM SATA disks
- ARECA SATA RAID controller
Software
The software listed below is installed on hydra. Click on the program names for instructions on how to run the programs.
- ADF2007.01 -- Amsterdam Density Functional 2007.01 package
- AMBER 8 -- molecular dynamics package
- GAMESS -- ab initio and semi-empirical quantum chemistry package
- Gaussian 03 -- electronic structure calculation package with density funcional theory capabilities
- GaussView 3.0 -- graphical user interface for Gaussian 03
- MOLPRO 2006.1 -- ab initio programs for molecular electronic structure calculations
- Intel 9.1 compilers -- C++ and Fortran compilers
- Portland Group 6.2 compilers -- C, C++ and Fortran compilers
File Systems
Head Node
- /export
The /export file system contains application software and user home directories. It is hardware RAID 1 (mirror) for redundancy, and it is about 384GB in size. Quotas are enabled to limit users to a fixed amount of space. Because it is NFS mounted to the compute nodes and reads/writes go over the network, it should not be used for computational scratch space.Regular Compute Nodes
- /scratch
/scratch is a 2-disk software RAID 0 (striping), about 434GB in size. It is faster than a single disk, and it is local storage on each node. This is where compute jobs should put scratch files. The queue system will automatically create a temporary directory in /scratch for each job and set the $TMP environment variable to the name of the directory. When the job completes, the queue system will automatically remove the temporary directory.
Whenever possible, the existing job submit scripts on hydra (gmssub, g03sub, etc.) will take the necessary steps to use the temporary directories in /scratch. If you write your own submit scripts, you should do this yourself.
Big Compute Node
- /scratch
On the "big" compute node, /scratch is a 4-disk hardware RAID 0 of about 893GB. Otherwise, it functions just like /scratch on the regular nodes.
Queue System
Contents:
- Overview
- Queue Structure
- Submitting a Job
- Using the Big Node
- Submitting Parallel Jobs
- Submitting GAMESS Jobs
- Submitting Gaussian 03 Jobs
- Submitting Molpro Jobs
- Interactive Jobs
- Getting the Status of a Job
- Deleting a Job
This document describes the Sun Grid Engine 6.0 queue system on the MSU Chemistry Department Linux cluster. It provides only a brief introduction to help users get starting using SGE. For more details, consult the appropriate man pages. The sge_intro man page (type "man sge_intro") gives a brief description of all of the SGE commands.
There is a single cluster queue that will accept submitted jobs and route them to available compute nodes. Because there is only one queue, there is no need to specify a queue when submitting jobs. There are no CPU, memory, disk or time limits. However, the queue system will only schedule jobs to run on processors that are idle. If there are not enough free processors to run a submitted job, the job will wait in the queue until enough free processors become available.
Jobs are submitted to SGE using the qsub command. Qsub accepts a shell script which contains the commands to be executed when the job runs. You can also instruct qsub to modify the characteristics of the job by embedding switches in the script or by placing them on the command line. There are many switches available; see the qsub man page for details.
A script file can be as simple as a single line of text containing the command to run. Here is an example script file:
a.out <file.input >file.outputThis job could be submitted to the default queue with the following command (assume the script file is named "scriptfile"):
qsub scriptfileIt is important to note that SGE will start a new login session for your script. One implication of this is that your script will have its working directory set to your home directory. If your script needs to be in a different directory, you will need to add the appropriate "cd" command to your script. You could also use the "-cwd" qsub option to have it start your script from the current working directory instead of your home directory.
Jobs can also be submitted interactively with qsub. For example, instead of putting "a.out ..." in a file and then submitting that file, the job could be submitted interactively as follows:
qsub <ENTER>
a.out <file.input >file.output <ENTER>
<CONTROL-D>The compute node named "compute-4d-0-0" has more memory and scratch disk space than the other nodes. To use this special node, add "-l bignode" to the qsub command. For example:
qsub -l bignode scriptfileSGE uses parallel environments to control the execution of parallel jobs. A parallel environment, or PE, is a collection of settings that is configured by the system administrator. These settings define parameters such as how to allocate nodes and the processors within those nodes. Several PEs are defined on hydra, but most users will need only two: mpich and g03.
Programs that use MPI, such as AMBER, should use the mpich PE. This will allow the queue system to allocate any available processors across all compute nodes. To use the mpich PE, add "-pe mpich n" to the qsub command, where n is the number of processors you wish to request. For example, to submit a job that will use 8 processors, type:
qsub -pe mpich 8 scriptfileNote that these 8 processors could be allocated as 4 processors each on 2 nodes, or 2 processors each on 4 nodes, or in any other combination that sums to 8.
For shared memory programs like Gaussian 03, use the g03 PE. This will cause all of the allocated processors to be on the same node. Since the nodes only have 4 processor cores, you should not request more than 4 processors. If you do, your job will just sit in the queue and wait forever. To submit a shared memory 4 processor job, type:
qsub -pe g03 4 scriptfileUse the following command to submit a GAMESS job:
gmssub [-b basisfile] [-m email] [-n ncpus] [qsub_args] file_namewhere "file_name" is the name of your input file without the ".inp" extension. The optional qsub_args will be passed to SGE. If an email address is given, the output file will be sent to that address upon completion of the job. GAMESS can run in parallel in the cluster, and you must specify the number of processors to use on your job. If you do not wish to run your job in parallel, specify 1 processor.
Use the following command to submit a Gaussian 03 job:
g03sub [-m email] [qsub_args] file_namewhere "file_name" is the name of your input file. The optional qsub_args will be passed to SGE. If an email address is given, the output file will be sent to that address upon completion of the job.
Note for parallel use: the g03sub command will look inside your input file for the %NProc= line, and it will automatically add the correct qsub options for a parallel job. You do NOT have to use the "-pe g03" option with g03sub.
Use the following command to submit a Molpro 2006.01 job:
m06sub [-n ncpus] file_namewhere "file_name" is the name of your input file. To run the job in parallel, add "-n ncpus" to the command, where ncpus is the number of CPUs to use. For example, to use 4 CPUs, type the following:
m06sub -n 4 file_nameSGE allows interactive programs to be run in the queue system. To start an interactive job, type:
qloginYou should see some messages similar to:
waiting for interactive job to be scheduled ...
Your interactive job 1564 has been successfully scheduled.Then you should get logged in to a compute node. At this point, your shell and every command you run will be executed under control of the queue system. When you logout of this shell, your queue session will end.
To check the status of a job, use the qstat command. For example:
qstatwill list all jobs running on hydra. The "qstat -f" command gives another useful view of the queues. It shows the execution queue on each node, along with the job(s) that are running on it and how many processors are being used.
The qdel command is used to delete a job from a queue. First get the job ID number by using the qstat command, then type qdel followed by the job ID.
For example, to delete job number 28, type:
qdel 28If you delete a running job, the queue system should kill all of the processes related to that job. However, the queue system cannot monitor certain kinds of parallel jobs. If you want to completely kill a parallel job, you should find out which nodes your job is running on ("qstat -f") BEFORE deleting it, then use qdel to delete the job. Finally, login to each of the nodes your job was running on and use the "ps" and "kill" commands to find and kill any of your remaining processes.

