[Top page] [Rants and Raves] [C.V.] [IBMS computers] [Beowulf Cluster] [Beowulf Cluster Queues] [Home computers] [Gateway computer] [S. America photos] [Dynamic mem in Fortran]

Quick links to examples:

  • Gaussian
  • Charmm
  • Amber
  • Gamess
  • Aces2 
  • Dock
  • ZDock
  • QChem
  • QChem with Charmm
  • Perl
  • NWChem
  • How to use the queues on the cluster.

    The principle behind the batch queue system is that the user prepares a submission script for their job which identifies the resources they need - here we will worry only about cpu time. Vanguard will then read the submission script and attempted to locate a free cpu for you, if none are available it will hold your job in a queue until a cpu is available for you to use automatically. Jobs should only need to be submitted once and will run as soon as the resources are available for them. Jobs started interactively on the cluster (eg you do an rsh albion then g03 myfile) will only run for 5 minutes before stopping. The batch management is via Torque (based on OpenPBS) and maui software and is managed by vanguard. Please don't run jobs on vanguard or even processes there (such as molden etc) as vanguard is an extrememly busy machine.

    The batch queue is currently running (01st Oct 2011) the queues below:

    Queue name

    Length

    Number of nodes in queue/ size of memory per machine/cores per node Default # of nodes/
    Default # cores per node

    Max No of running jobs / user at once

    parallel 28 days 51 /  4Gb/ 4 1 / 1
    16
    quad 28 days 30 / 8Gb / 4 1 / 4 8
    octa 28 days 8 / 16Gb / 8 1 / 8 4
    bigmem 28 days 8 / 16Gb / 8 1 / 8 4
    dellr610
    28 days
    12 / 32Gb / 12
    1 / 12
    6

    There is also a total load of 24 running jobs and up to 128 processors at any one time by a single user. Jobs that are queued accumulate a priority score based on how long they have been queued and how much cpu the user has used recently so that the cluster is shared out equally to all users when under a high load. I.e if you have not used the cluster alot recently and your job has been queued as long as another user who has used lots of cpu recently then your job should have a higher prioroty than theirs for starting next. When under a low load the cluster will attempt to run as many jobs as possible regardless of previous use by a user. All these numbers can be changed if they do not allow efficent use of the cluster. All jobs wanting to run on the cluster must be submitted to one of the above queues using a submission script examples of which are given below. If all the nodes for a queue are in use then the cluster will automatically assign unused nodes from other queues to the 'full' queue. All input files and output files will need to be referenced in the submission script to make them available for the jon run and to return the results back to you at the end.

    Note that for all queues except the parallel the default number of processors is equal to the total on one node, for the parallel queues the default is 1 processor allowing 4 jobs to run simultaneously.

    How to submit, check the status and delete a job to the batch queue system.

    Jobs are submitted on vanguard or leo or jordan using the command

    % qsub job.in
    (where job.in is the submission script), this will return a line like
    28.vanguard
    to tell you what your JOBID is

    To find out how many jobs are in the system and running etc use:

    % qstat
    and it'll return something like
    28.vangaurd gau_test jon 00:01:09 R long
    ....
    ....
    The R means running, a Q would mean queued but not running. It also displays the walltime that the job has taken (here 1min 9sec), other use of qstat with different options may give cputime (which is affected by number of cpus) or a number that is very very low because of the use of more than one node.

    To list the status of all queues quickly use:

    % qstat -q
    This will return the time limts for each queue, the number of jobs running, the number queued.

    To list just a single queue e.g. the long queue use:

    % qstat long

    To delete a job in the queue use:

    % qdel 28.vanguard
    (or whatever you JOBID was)

    To find which node(s) your job is running on use:

    % qstat -n
    will return a list of jobs and their hosts, you can then rsh to that machine and find your running output files in a directory like name.number where name is your username and number is your JOBID number.

    To find out why your job is queued and not running use:

    % checkjob 28
    (or whatever your JOBID was) will return a reason for why your job is still queued and not running.

    To find some of your files after you have overrun the queue time limits

    If your job overruns the queue time limits the queue manager will kill it which has the side-effect in that the commands in your submission script which you use to copy the output/restart files back to vanguard will not be run. However the queue system automatically archives certain files from every job that is run and stores them in /home/rescue/archive on vanguard for 7 days, so check there and see if any of your files are available.

    Moving files to and from the cluster.

    All files need to be copied onto the cluster and files you want to keep copied back at the end of the run, regardless of which machine you submit from the copying into and off the cluster goes via vanguard.
    To copy a gaussian input file to the machine you are running on from the submission directory

    rcp vanguard:$PBS_O_WORKDIR/gau_input.com .
    To copy the output back at the end of the calculation:
    rcp gau_input.log vanguard:$PBS_O_WORKDIR/.

    PBS options for job submission scipts.

    It is recommened that all jobs that first submitted to the express queue to check that the input files are correct before submitted to the longer queues.

    Options of the batch queues submission scripts are normally listed at the top of the script and always start on a new line and with the characters #PBS, any lines being with # are assumed to be comments. There are two principle ways to send a input file to the cluster, embed it within the submission script (eg gaussian) or to prepare it beforehand and have the submission script copy it to the cluster at the right time.

    To choose a queue you would use

    #PBS -q parallel
    it is important to pick the right queue because if all jobs are submitted to long then the cluster system can not schedule the jobs efficently.

    Naming your job.

    To name your job so you can tell jobs apart you can use something like
    #PBS -N gln248_free

    Control of the PBS error and output files.

    When the submit a job it is sent to the most appropiate machine to handle, any error messages and other output that would normally be sent to the screen if you ran the job interactively will be sent back to the directory you submitted from in files called things like g03_c22n7o_opt.e28, where the first part is the name you assigned the job and the number part is the JOBID number.

    The names and loactions of these files can be changed using something like:

    #PBS -o wat_mp2.out
    #PBS -e wat_mp2.err
    Which could be combined and simplified to
    #PBS -j eo
    #PBS -e wat_mp2.err

    $WORKDIR - temp working directories.

    A directory is automatically created and removed by the batch system but the submission script must use that directory by issuing a cd $WORKDIR command, see the example below. Jobs should always be run in the $WORKDIR directory to prevent clashes with other users on the system. Don't forget to copy your files back to vangaurd at the end of each job otherwise the output will be lost.

    $PBS_O_WORKDIR - the submission directory on vanguard.

    Using $PBS_O_WORKDIR in a PBS script is a short way of referring to the directory where the submission script lives, it can be used to easily locate files on vanguard for coping to and from the cluster machines e.g.
    # copy the input file from the directory the job was submitted from 
    # on vangaurd to the $WORKDIR (we assume user has already done cd $WORKDIR)
    rcp vangaurd:$PBS_O_WORKDIR/fred.in .

    # copy the output file to same directory on that we submitted from
    rcp fred.out vangaurd:$PBS_O_WORKDIR/.

    Accessing the NFS file servers directly.

    While using vangaurd (as in the examples below) works it is faster to rcp directly to and from the nfs file server directly, if you are based on disks on 1-5 the nfs server is Juno, if your home directory is on disks 11-15 it will be Jaguar i.e. you can replace vanguard in the below

    rcp vangaurd:$PBS_O_WORKDIR/fred.in .
    with

    rcp juno:$PBS_O_WORKDIR/fred.in .
    or
    rcp jaguar:$PBS_O_WORKDIR/fred.in .

    Example submission script for g03.

    # Name of job
    #PBS -N gau_test
    # combine std out and err and send them back to the
    # working directory
    #PBS -j eo
    #PBS -e gau_test.err
    # Queue to use
    #PBS -q parallel
    #PBS -l nodes=1:ppn=4

    # Gather some details about the host for reference later
    echo Running on host `hostname`
    echo Time is `date`

    # move into the temp directory for this job
    cd $WORKDIR
    echo Working directory is $WORKDIR

    # prepare the g03 input file
    # this next line means copy everything below until we see
    # EOF into a file called blyp.com
    cat > blyp.com << EOF
    % chk=blyp
    % mem=3500mb
    % nproc=4
    #P blyp/6-31+g* SCF=DIRECT maxdisk=1000000

    Water test case - v bad geometry

    0 1
    O
    H,1,r21
    H,1,r21,2,a312
    Variables:
    r21=0.99
    a312=104.5

    EOF
    # The EOF above marks the end of the g03 input file

    # Run the job
    g03 blyp

    # copy the output files back to vanguard the submission directory
    rcp blyp.log vanguard:$PBS_O_WORKDIR/.

    This submits a job gau_test to queue parallel on the cluster and would return the output files and the checkpoint file to the vanguard directory where it was submitted from. The standard out and standard error of the jobs will also be returned to the directory the job was submitted from.

    For parallel Gaussian add the lines

    #PBS -l nodes=1:ppn=X
    % nproc=X
    where X is 4 for the parallel and quad queues, 8 for the octa and bigmem queues and 12 for the dellr610a queue and then submit
    Quick guide to nproc and mem values for the queue system, the mem values are the largest values for each queue, if your job can run with smaller values then it probably will do much faster than using these values.
    Queue %nproc %mem
    parallel

    4

    3500mb
    quad

    4

    6500mb
    octa/bigmem 8 14000mb
    dellr610
    8
    28000mb
    dellr610a
    12
    28000mb


    Example Charmm submission script.

    To run a charmm jobs things are alittle more complex because of the number of input and output files, however do note that we are actually going to run two charmm jobs from within one submission script, this is useful for trying to run follow on jobs with charmm or anyother program:

    # Name of job

    #PBS -N dyn_test

    # names of std out and std err to be sent back to this directory

    #PBS -j eo
    #PBS -e charmm_test.err

    # Queue to use
    #PBS -q parallel
    #PBS -l nodes=1:ppn=1

    # Gather some details about the host for reference later
    echo Running on host `hostname`
    echo Time is `date`

    # move into the temp directory for this job
    cd $WORKDIR
    echo Working directory is $WORKDIR

    # copy the input and parameter files over
    # param files come from a parameter directory
    # other files from the current working directory
    rcp vanguard:params/top.inp .
    rcp vanguard:params/par.inp .
    rcp vanguard:$PBS_O_WORKDIR/min1.i .
    rcp vanguard:$PBS_O_WORKDIR/dyn.i .
    rcp vanguard:$PBS_O_WORKDIR/ige3.psf .
    rcp vanguard:$PBS_O_WORKDIR/ige3a.crd .

    # Run the job
    export runc34=/home/.2/charmm34b2/exec/gnu/charmm34.pgi64
    $runc34 < min1.i > min1.o
    $runc34 < dyn.i > dyn.o

    # copy the output files back to vanguard, where the submission took place
    rcp min1.o vanguard:$PBS_O_WORKDIR/.
    rcp min1.crd vanguard:$PBS_O_WORKDIR/.
    rcp dyn.o vanguard:$PBS_O_WORKDIR/.
    rcp dyn.crd vanguard:$PBS_O_WORKDIR/.

    This submits a job called dyn_test to the parallel queue, when the job runs it moves the job into a directory and copys files in from vanguard, charmm is then run at at the end the output files are copied back to vanguard. We then clean up the scratch directory.

    Parallel Charmm

    For 8 cpu parallel charmm you must submit to the job to the parallel queue and the script must be altered as in the example below.

    # Name of job
    #PBS -N test_solv
    # names of std out and std err to sent back to this directory just use
    # the names without the preceeding path
    #PBS -j eo
    #PBS -e errors
    # Queue to use
    #PBS -q parallel
    #Request 2 nodes with 4 processors per node i.e. total 8 cpu's
    #PBS -l nodes=2:ppn=4
    # This jobs working directory is set below

    echo Running on host `hostname`
    echo Time is `date`

    cd $WORKDIR
    echo Working directory is $WORKDIR

    #copy some file
    rcp vanguard:$PBS_O_WORKDIR/dimer.psf .
    rcp vanguard:$PBS_O_WORKDIR/test9.crd .
    rcp vanguard:$PBS_O_WORKDIR/test9.res .
    rcp vanguard:$PBS_O_WORKDIR/test10.i .
    rcp vanguard:$PBS_O_WORKDIR/top_all27_prot_na.rtf .
    rcp vanguard:$PBS_O_WORKDIR/par_all27_prot_na.prm .

    # Set up the parallel environment
    export PATH=/usr/local/mpich2-1.2.0_pgi902/bin:$PATH
    export charmm=/home/.2/charmm35b3/exec/gnu/charmm.pgi90.mpich1.20.qchem
    cat $PBS_NODEFILE | sort | uniq > mpd.hosts
    mpdcleanup -f mpd.hosts
    mpdboot --rsh=/usr/bin/rsh -n `wc -l < mpd.hosts`

    # Run the job
    mpiexec -n `wc -l < $PBS_NODEFILE` $charmm < test10.i > test10.o

    #release the nodes from the mpi system
    mpdallexit

    # copy the output files back to vanguard do not overwrite
    # the starting crd/restart file
    /bin/rm test9.crd test9.res
    rcp *.o vanguard:$PBS_O_WORKDIR/.
    rcp *.crd vanguard:$PBS_O_WORKDIR/.
    rcp *.traj vanguard:$PBS_O_WORKDIR/.
    rcp *.res vanguard:$PBS_O_WORKDIR/.

    Hopefully this should be all that is needed. The cluster is set up so that machines on the same switches will be utalised for parallel runs as far as possible.


    Example of a parallel Amber submission script.

    Parallel work using Amber9 can be performed, for one cpu (serial) work it should be easy to see what changes are needed.

    # Name of job
    #PBS -N wt_cole7_equil
    # combine std err and out and send them back to the submission dir
    #PBS -j eo
    #PBS -e equil.err
    # Queue to use
    #PBS -q parallel
    #PBS -l nodes=2:ppn=4
    # This jobs working directory is set below

    echo Running on host `hostname`
    echo Time is `date`

    cd $WORKDIR
    echo Working directory is $WORKDIR

    #copy some file
    rcp vanguard:$PBS_O_WORKDIR/wt1.res .
    rcp vanguard:$PBS_O_WORKDIR/wt.top .
    rcp vanguard:$PBS_O_WORKDIR/equil.in .


    # clean nodes, prepare the nodes file and link the nodes togeather
    cat $PBS_NODEFILE | sort | uniq > mpd.hosts
    mpdcleanup -f mpd.hosts

    mpdboot --rsh=/usr/bin/rsh -n `wc -l < mpd.hosts`


    # Run a parallel mpirun job
    mpiexec -n `wc -l < $PBS_NODEFILE` \
    $AMBERHOME/exe.pgi64/sander.MPI -O -i equil.in \
    -o equil.out -c wt1.res -r equil.res -x equil.traj \
    -inf wt1.inf -p wt.top -ref wt1.res

    #Clean up the nodes
    mpdallexit

    # copy the output files back to vanguard
    rcp equil.out vanguard:$PBS_O_WORKDIR/.
    rcp equil.res vanguard:$PBS_O_WORKDIR/.
    rcp equil.traj vanguard:$PBS_O_WORKDIR/.
    rcp equil.inf vanguard:$PBS_O_WORKDIR/.

    Example Gamess submission script with parallel processing.

    Other types of job can also be submitted using similar scripts, a side-effect of the batch submission is that input files with $ in themcanget scrambled, a way around this is to use something like % and then use sed to convert the % to $, this is not a problem for input files that are copied from vanguard.

    A Gamess example:

    ### Name of job
    #PBS -N gamess_test
    # combine the std out and std err and send to a file called errors in the
    # submission directory
    #PBS -j eo
    #PBS -e errors
    ### Queue to use
    #PBS -q parallel
    #PBS -l nodes=1:ppn=1

    ### Gather some details about the host for reference later
    echo Running on host `hostname`
    echo Time is `date`

    # move into the temp directory for this job
    cd $WORKDIR
    echo Working directory is $WORKDIR

    # prepare the gamess input file
    # Here we have to use % instead of $ because tcsh doesn't
    # like it so we then use sed to convert it back
    cat > tmp.inp << EOF
    !
    %CONTRL SCFTYP=RHF RUNTYP=OPTIMIZE COORD=ZMT NZVAR=0 %END
    %SCF DIRSCF=.TRUE. %END
    %SYSTEM MWORD=6 %END
    %STATPT OPTTOL=1.0E-5 %END
    %BASIS GBASIS=N31 NGAUSS=6 NDFUNC=1 %END
    %DFT DFTTYP=SVWN %END
    %SCF DIRSCF=.TRUE. %END
    %GUESS GUESS=HUCKEL %END
    %DATA
    Methylene...1-A-1 state...RSVWN/6-31G*
    Cnv 2

    C
    H 1 rCH
    H 1 rCH 2 aHCH

    rCH=1.09
    aHCH=110.0
    %END

    EOF

    sed s/\%/\$/g tmp.inp > meth.inp

    # Run the job
    /home/.2/gamess_2009r3.linux/rungms meth > meth.log

    # copy the output files back to vanguard
    rcp meth.log vanguard:$PBS_O_WORKDIR/.


    For parallel gamess set the number of nodes you want using
    #PBS -l nodes=2:ppn=4
    #PBS -q parallel

    then replace the rungms line with something like
    /home/.2/gamess_2007/rungms.pbs meth 01 8 > meth.log

    where the first number is 01 and the second number is the nodes * ppn e.g. here 4*2=8

    and gamess should run in parallel.

    Example ACES2 submission script.

    An Aces2 example:

    ### Name of job
    #PBS -N aces2_test
    # combine the std out and std err and send to a file called errors in the
    # submission directory
    #PBS -j eo
    #PBS -e errors
    ### Queue to use
    #PBS -q parallel
    #PBS -l nodes=1:ppn=1

    ### Gather some details about the host for reference later
    echo Running on host `hostname`
    echo Time is `date`

    # move into the temp directory for this job
    cd $WORKDIR
    echo Working directory is $WORKDIR

    # prepare the gamess input file
    # NOTE you must use ZMAT as the input file here
    cat > ZMAT << EOF
    Water CC-LR/DZP at experimental equilibrium geometry
    O
    H 1 R
    H 1 R 2 A

    R=0.958
    A=104.5

    *ACES2(CALC=CCSD,BASIS=DZP,EXCITE=EOMCC)

    %excite*
    1
    1
    1 5 0 6 0 1.0

    EOF

    # Set up the ACES 2 evironment
    # For an AMD64 machine use source /home/.2/aces2/cshrc.amd64 instead
    source /home/.2/aces2/cshrc
    cp /home/.2/aces2/basis/GENBAS .
    # Run the job
    xaces2 > water.log

    # copy the output files back to vanguard
    rcp water.log vanguard:$PBS_O_WORKDIR/.

    Example DOCK submission script.

    ### Name of job
    #PBS -N dock_test
    # combine the std out and std err and send to a file called errors in the
    # submission directory
    #PBS -j eo
    #PBS -e errors
    ### Queue to use
    #PBS -q parallel
    #PBS -l nodes=1:ppn=1

    ### Gather some details about the host for reference later
    echo Running on host `hostname`
    echo Time is `date`

    # move into the temp directory for this job
    cd $WORKDIR
    echo Working directory is $WORKDIR

    # prepare the docking directory rcp -r copies entire directories and their
    # contents
    rcp vanguard:$PBS_O_WORKDIR/INDOCK .
    rcp vanguard:$PBS_O_WORKDIR/split_database_index .
    rcp vanguard:$PBS_O_WORKDIR/dock52 .
    rcp vanguard:$PBS_O_WORKDIR/vdw.parms.amb.mindock .
    rcp -r vanguard:$PBS_O_WORKDIR/dist .
    rcp -r vanguard:$PBS_O_WORKDIR/grids .
    rcp -r vanguard:$PBS_O_WORKDIR/crds .

    # Run the job
    ./dock52

    # copy the output files back to vanguard
    rcp OUTDOCK vanguard:$PBS_O_WORKDIR/.
    rcp test.3 vanguard:$PBS_O_WORKDIR/.
    rcp test.eel3 vanguard:$PBS_O_WORKDIR/.


    Eaxmple ZDOCK submission script. 

    This is for submitting a ZDock job to run on one node and two cpus, this uses mpich1 and 32bits so we need to use a different calling system for mpi.
    # Name of job
    #PBS -N a_mpi
    # names of std out and std err to sent back to this directory just use
    # the names without the preceeding path
    #PBS -j eo
    #PBS -e a_mpi_err
    # Queue to use
    #PBS -q parallel
    # This jobs working directory is set below

    echo Running on host `hostname`
    echo Time is `date`

    cd $WORKDIR
    echo Working directory is $WORKDIR

    #set up the p4pg file
    echo "$HOSTNAME 0 /home/.2/zdock-2.3/zdock.mpi" > pgfile
    echo "$HOSTNAME 1 /home/.2/zdock-2.3/zdock.mpi" >> pgfile

    #copy some file
    rcp vanguard:$PBS_O_WORKDIR/xol_l_b.pdb .
    rcp vanguard:$PBS_O_WORKDIR/ige_r_a.pdb .

    # Run the job
    export P4_RSHCOMMAND=rsh
    time /usr/local/mpich-1.2.7p1/bin/mpirun -np 2 -p4pg pgfile \
    /home/.2/zdock-2.3/zdock.mpi \
    -R ige_r_a.pdb -L xol_l_b.pdb -o test_a.out

    # copy the output files back to vanguard
    rcp test_a.out vanguard:$PBS_O_WORKDIR/.

    To follow it up with an RDock job a script like below can be used to run on one node with two cpus with 1/2 the configs done on one cpu and half on the other.
    # Name of job
    #PBS -N ige_xol_rdock
    # names of std out and std err to sent back to this directory just use
    # the names without the preceeding path
    #PBS -j eo
    #PBS -e errors
    # Queue to use
    #PBS -q parallel
    # This jobs working directory is set below

    echo Running on host `hostname`
    echo Time is `date`

    cd $WORKDIR
    echo Working directory is $WORKDIR

    #copy some file
    export ZDOCK_HOME=/home/.2/zdock-2.3
    rcp vanguard:$PBS_O_WORKDIR/ige_xol.out .
    rcp vanguard:$PBS_O_WORKDIR/xol_l_a.pdb .
    rcp vanguard:$PBS_O_WORKDIR/ige_r_b.pdb .
    rcp vanguard:$ZDOCK_HOME/rdock_jon.pl rdock.pl
    rcp vanguard:$ZDOCK_HOME/pdb2crd .
    rcp vanguard:$ZDOCK_HOME/deltaG .
    rcp vanguard:$ZDOCK_HOME/create_lig .
    rcp vanguard:$ZDOCK_HOME/BND.charmm .
    rcp vanguard:$ZDOCK_HOME/RTF.charmm .
    rcp vanguard:$ZDOCK_HOME/amino.rtf .
    rcp vanguard:$ZDOCK_HOME/param.prm .

    # Run the job, the wait allows for the situation where one background job
    # ends before the other - we will wait for _both_ to finish
    chmod u+x rdock.pl pdb2crd deltaG create_lig
    ./rdock.pl -d ./ -x 1 1000 -o out -i ige_xol_a1.out &
    ./rdock.pl -d ./ -x 1001 2000 -o out1 -i ige_xol_a2.out &
    wait

    # copy the output files back to vanguard
    rcp ige_xol_a1.out vanguard:$PBS_O_WORKDIR/.
    rcp ige_xol_a2.out vanguard:$PBS_O_WORKDIR/.

    Eaxmple QChem submission script.

    This is for submitting a QChem job to run on 2 nodes and 2 cpus. Qchem requires that all the nodes can write to the same disk space which is something we don't normally have on the cluster. At the moment we are using vangaurds archive space as a common store. Qchem is only licensed on some bigmem and long queue machines and in order to ensure you are assigned a machine with a license you need to include a line like '#PBS -l nodes=1:qchem'. 
    ### Name of job
    #PBS -N qchem_test
    #names of std out and std err to sent back to this directory just use
    #the names without the preceeding path
    #PBS -j eo
    #PBS -e errors
    ### Queue to use
    #PBS -q quad
    #Tell PBS we want 1 machine with 4 cores
    #PBS -l nodes=1:ppn=4
    ###Get some useful info for debugging later on
    echo Running on host `hostname`
    echo Time is `date`

    # QCHEM
    export HOSTNAME=`hostname`
    export QC=/home/.2/qchem3102
    source $QC/bin/qchem.setup.sh
    QCSCRATCH=/cluster/$HOSTNAME/${PBS_O_LOGNAME}.${PBS_JOBID}
    QCLOCALSCR=$WORKDIR
    mkdir -p $QCSCRATCH
    cd $WORKDIR

    cp /home/.2/qchem3102/samples/DFT_glutamine.in .

    cat $PBS_NODEFILE | sort | uniq > mpd.hosts
    mpdcleanup -f mpd.hosts
    $QC/bin/mpi2/mpdboot --rsh=/usr/bin/rsh -n `wc -l < mpd.hosts`
    qchem -pbs -np `wc -l < $PBS_NODEFILE` DFT_glutamine.in > DFT_glutamine.out
    $QC/bin/mpi2/mpdallexit

    rcp DFT_glutamine.out vanguard:$PBS_O_WORKDIR/.
    /bin/rm -rf /cluster/$HOSTNAME/${PBS_O_LOGNAME}.${PBS_JOBID}
    /bin/rm -rf /cluster/$HOSTNAME/${PBS_O_LOGNAME}.${PBS_JOBID}.*

    Note if you use the dellr610 queue you can use the dedicated GFS2 iSCSI array for the common space
    to do this change the QCSCRATCH line to the below instead:
    $QCSCRATCH=/clussan/iscsi1/${PBS_O_LOGNAME}.${PBS_JOBID}

    Eaxmple QChem/Charmm QM/MM submission script.

    #PBS -V
    #PBS -j oe
    #PBS -o errors
    #PBS -N cqtest2
    #PBS -q quad@vanguard
    #PBS -l nodes=1:ppn=4

    echo "Time is `date`"
    echo "Running on host(s) `cat $PBS_NODEFILE`"
    echo "with likely master node `hostname`"
    cd $WORKDIR
    cat $PBS_NODEFILE | sort | uniq > mpd.hosts

    # QCHEM
    export HOSTNAME=`hostname`
    export QC=/home/.2/qchem3102
    source $QC/bin/qchem.setup.sh
    QCSCRATCH=/cluster/$HOSTNAME/$PBS_O_LOGNAME.$PBS_JOBID.global
    QCLOCALSCR=$WORKDIR
    mkdir -p $QCSCRATCH
    export PATH=$QC/bin/mpi2/:$PATH

    # CHARMM+QCHEM
    QCHEMINP=cq.inp
    QCHEMEXE=qchem\ -pbs
    QCHEMCNT=qchem.inp
    QCHEMOUT=w2.qcout
    CHARMMEXE=/home/.2/charmm35b3/exec/gnu/charmm.pgi90.mpich1.20.qchem
    export QCHEMINP QCHEMEXE QCHEMCNT QCHEMOUT QCSCRATCH QCLOCALSCR CHARMMEXE PATH

    rcp vanguard:$PBS_O_WORKDIR/$QCHEMCNT .
    rcp vanguard:$PBS_O_WORKDIR/w2.inp .

    # each charmm instance also starts "nproc" number of qchem instances
    # hence the manual 2 below (and a PARA 2 in input file)
    # gives 2 charmm x 2 qchem = 4 (quad)
    cat $PBS_NODEFILE | sort | uniq > mpd.hosts
    mpdcleanup -f mpd.hosts
    $QC/bin/mpi2/mpdboot --rsh=/usr/bin/rsh -n `wc -l < mpd.hosts`
    $QC/bin/mpi2/mpiexec -n 2 $CHARMMEXE < w2.inp > w2.out
    $QC/bin/mpi2/mpdallexit

    # mpiexec -n `wc -l < $PBS_NODEFILE` $CHARMMEXE < w2.inp > w2.out

    rcp w2.out vanguard:$PBS_O_WORKDIR/.
    rcp test.coor vanguard:$PBS_O_WORKDIR/.
    rcp w2.qcout vanguard:$PBS_O_WORKDIR/.
    /bin/rm -rf /cluster/$HOSTNAME/${PBS_O_LOGNAME}.${PBS_JOBID}
    /bin/rm -rf /cluster/$HOSTNAME/${PBS_O_LOGNAME}.${PBS_JOBID}.*
    Note if you use a STREAM file as the initial input to Charmm which then streams the actual charmm script then you must assign the actual charmm script to be streamed from a unit number that is not 99 (99 is the default) because qchem also uses unit 99 and this causes charmm to stop reading the streamed file after qchem has run. Something like:
    * MPI does not allow rewind on stdin, thus loops in CHARMM will fail.
    * Streaming the whole inputfile is a workaround (see e.g. parallel.doc).
    * The default unit is 99 but this causes problems with qchem so here we use 77 instead.
    *
    open unit 77 read form name w2.inp
    STREam unit 77
    STOP
    should work in this case.

    Example of submission script for a perl job.

    Due to the problems with the $ character in the submission scripts it is recommeded that perl scripts are not embedded into the submission scripts but instead the perl script is prepared beforehand and copied to the cluster using rcp like in the Charmm example above.

    ### Name of job
    #PBS -N perl_test
    # combine the std out and std err and send to a file called errors in the
    # submission directory
    #PBS -j eo
    #PBS -e errors
    ### Queue to use
    #PBS -q parallel
    #PBS -l nodes=1:ppn=1

    ### Gather some details about the host for reference later
    echo Running on host `hostname`
    echo Time is `date`

    # move into the temp directory for this job
    cd $WORKDIR
    echo Working directory is $WORKDIR

    # prepare the working directory
    # remember to make the perl scripts executatble using chmod
    rcp vanguard:$PBS_O_WORKDIR/all_mut.pl .
    rcp vanguard:$PBS_O_WORKDIR/extract.pl .
    rcp vanguard:$PBS_O_WORKDIR/template.pdb .
    chmod u+x all_mut.pl extract.pl

    # run the perl script
    ./all_mut.pl
    # gather the energies from the output files
    ./extract.pl > final_energies

    # copy the wanted files back to vanguard
    rcp final_energies vanguard:$PBS_O_WORKDIR/.


    Example of submission script for an nwchem job.

    Initial job setup can be performed using the ecce_builder program, input and output files can be converted to and from pdb formats using the babel program, or opened directly with the jmol program.
    This will run nwchem on 1 nodes with 4 cpu's per node i.e. 4 cpu's.
    # Name of job
    #PBS -N nw_test
    # names of combined std out and err
    #PBS -j eo
    #PBS -e nw_run1.err
    # Queue to use
    #PBS -q parallel
    #PBS -l nodes=1:ppn=4
    # This jobs working directory is set below
    echo Running on host `hostname`
    echo Time is `date`

    # get a unique directory name and move into it
    cd $WORKDIR
    echo Working directory is $WORKDIR

    # prepare the nwchem input file
    # this next line means copy everything below until we see
    # EOF into a file called nw_blyp.nw
    cat > nw_blyp.nw << EOF
    start dft_test

    charge -1
    memory 1500 mb

    geometry units angstroms
    P 0.000000 0.000000 0.000000
    O 0.000000 0.000000 1.498500
    O 1.335749 0.000000 -0.959419
    O -1.429242 0.228301 -0.793491
    O 0.246763 1.801356 -0.145602
    O -0.147288 -1.642459 -0.263059
    C -2.148059 1.476757 -0.919444
    C -0.590939 2.681214 0.605684
    C -0.240092 -2.267327 -1.548966
    C -2.070013 2.274822 0.384177
    H 1.681536 0.894983 -1.060784
    H -1.730066 2.046658 -1.740307
    H -3.165862 1.194862 -1.157882
    H -0.413637 3.685782 0.230216
    H -0.337654 2.638306 1.657363
    H -0.359894 -3.329649 -1.370195
    H -1.098029 -1.891222 -2.096032
    H 0.660549 -2.094186 -2.126827
    H -2.715661 3.148885 0.328365
    H -2.410151 1.643125 1.195446
    end

    print low

    basis "cd basis"
    C library "Ahlrichs Coulomb Fitting"
    P library "Ahlrichs Coulomb Fitting"
    H library "Ahlrichs Coulomb Fitting"
    O library "Ahlrichs Coulomb Fitting"
    end

    basis "ao basis"
    C library 6-31+g*
    P library 6-31+g*
    H library 6-31g
    O library 6-31+g*
    end

    dft
    XC becke88 lyp
    grid ssf euler lebedev 75 11
    end

    scf
    maxiter 30
    end

    driver
    maxiter 30
    end

    task dft optimize
    EOF
    # The EOF above marks the end of the nwchem input file

    #Prepare the nodes file for parallel nwchem runs
    mpdallexit
    cat $PBS_NODEFILE | sort | uniq > mpd.hosts
    mpdboot --rsh=/usr/bin/rsh -n `wc -l < mpd.hosts`
    export nwchem=/home/.2/nwchem5.1/bin/nwchem.mpich2

    mpiexec -n `wc -l < $PBS_NODEFILE` $nwchem nw_blyp > nw_blyp.out
    mpdallexit

    # copy the output and restart required files back to vanguard
    rcp nw_blyp.out vanguard:$PBS_O_WORKDIR/nw_run1.out
    rcp dft_test.db vanguard:$PBS_O_WORKDIR/dft_test.db
    rcp dft_test.movecs vanguard:$PBS_O_WORKDIR/dft_test.movecs

    For a zmatrix example (CH3CF3):
    geometry
    geometry 
    zmatrix
    C
    C 1 CC
    H 1 CH1 2 HCH1
    H 1 CH2 2 HCH2 3 TOR1
    H 1 CH3 2 HCH3 3 -TOR2
    F 2 CF1 1 CCF1 3 TOR3
    F 2 CF2 1 CCF2 6 FCH1
    F 2 CF3 1 CCF3 6 -FCH1
    variables
    CC 1.4888
    CH1 1.0790
    CH2 1.0789
    CH3 1.0789
    CF1 1.3667
    CF2 1.3669
    CF3 1.3669
    constants
    HCH1 104.28
    HCH2 104.74
    HCH3 104.7
    CCF1 112.0713
    CCF2 112.0341
    CCF3 112.0340
    TOR1 109.3996
    TOR2 109.3997
    TOR3 180.0000
    FCH1 106.7846
    end
    end


    Command useful to the queuing system.

    qsub

    Submit a job to the batch queue system

    qstat

    Get the status of the batch queue system

    qdel

    Delete a job from a queue

    pbsdsh

    Distribute a task to the nodes of a PBS job

    qalter

    Changes the characteristics of a PBS job that is waiting to run

    qmgr

    Displays, adds, changes, or deletes PBS server, queue and node configuration information. General users can only display information about the PBS configuration

    qmove

    Moves PBS jobs between queues.

    qmsg

    Sends a message to a PBS job

    qorder

    Exchange to oder of two PBS jobs in a queue.

    qrerun

    Rerun a PBS job

    qselect

    List PBS job identifiers for jobs meeting selection criteria

    qsig

    Send a signal to a PBS job

    xpbs

    An X-Windows interface for using PBS and monitoring PBS jobs

    xpbsmon

    An X-Windows interface for monitoring PBS batch nodes.

    checkjob

    Get information about specific job details wrt to the queue and policy.


    An example of cluster fairsharing on an express queue

    First we check what queues are available
    [jon@vanguard /tmp]-% qstat -q

    server: vanguard.carmay.office


    Queue Memory CPU Time Walltime Node Run Que Lm State
    ---------------- ------ -------- -------- ---- --- --- -- -----
    parallel -- -- 168:00:0 -- 14 0 -- E R
    bigmem -- -- 504:00:0 -- 6 0 -- E R
    medium64 -- -- 336:00:0 -- 23 0 -- E R
    long -- -- 504:00:0 -- 16 1 -- E R

    We then submit a job to the express queue, then then check it a few times to see it queued and also running, it usually takes about 10 seconds to go from Q to R (if cpus are available)
    [jon@vanguard /tmp]-% qsub g03_run.in
    557.vanguard

    [jon@vanguard /tmp]-% qstat
    Job id Name User Time Use S Queue
    ---------------- ---------------- ---------------- -------- - -----
    557.vanguard gau_test jon 0 Q express

    [jon@vanguard /tmp]-% qstat
    Job id Name User Time Use S Queue
    ---------------- ---------------- ---------------- -------- - -----
    557.vanguard gau_test jon 0 R express
    Now we submit 7 jobs to see the fairshare working, at any one time on this queue we should only see 3 jobs running by the same person, since each job is the same we expect to see 3 run then when they stop 3 more run and finally the seventh job run. If the jobs were different lenghts then they would not go in bacthes of three. In fact we were lucky here to see the 7th job get an E which means exiting or in our case its copying the output files back to the file servers.
    [jon@vanguard /tmp]-% qstat
    Job id Name User Time Use S Queue
    ---------------- ---------------- ---------------- -------- - -----
    558.vanguard gau_test jon 0 Q express
    559.vanguard gau_test jon 0 Q express
    560.vanguard gau_test jon 0 Q express
    561.vanguard gau_test jon 0 Q express
    562.vanguard gau_test jon 0 Q express
    563.vanguard gau_test jon 0 Q express
    564.vanguard gau_test jon 0 Q express

    [jon@vanguard /tmp]-% qstat
    Job id Name User Time Use S Queue
    ---------------- ---------------- ---------------- -------- - -----
    558.vanguard gau_test jon 0 R express
    559.vanguard gau_test jon 0 R express
    560.vanguard gau_test jon 0 R express
    561.vanguard gau_test jon 0 Q express
    562.vanguard gau_test jon 0 Q express
    563.vanguard gau_test jon 0 Q express
    564.vanguard gau_test jon 0 Q express

    [jon@vanguard /tmp]-% qstat -a

    vanguard:
    Req'd Req'd Elap
    Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
    --------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
    561.vanguard jon express gau_test -- 1 -- -- 03:00 Q --
    562.vanguard jon express gau_test -- 1 -- -- 03:00 Q --
    563.vanguard jon express gau_test -- 1 -- -- 03:00 Q --
    564.vanguard jon express gau_test -- 1 -- -- 03:00 Q --

    [jon@vanguard /tmp]-% qstat
    Job id Name User Time Use S Queue
    ---------------- ---------------- ---------------- -------- - -----
    564.vanguard gau_test jon 0 Q express

    [jon@vanguard /tmp]-% qstat
    Job id Name User Time Use S Queue
    ---------------- ---------------- ---------------- -------- - -----
    564.vanguard gau_test jon 0 E express
    Right last example for now, user jon1 submits 4 jobs to express queue, user jon submits one, here we see that the 4th jon1 job is held and that the user jon job runs in preference to make the usuage more fair.
    [jon@vanguard /tmp]-% qstat
    Job id Name User Time Use S Queue
    ---------------- ---------------- ---------------- -------- - -----
    582.vanguard gau_test jon1 00:02:24 R express
    583.vanguard gau_test jon1 00:00:16 R express
    584.vanguard gau_test jon1 00:00:40 R express
    585.vanguard gau_test jon1 0 Q express
    586.vanguard gau_test jon 00:00:20 R express

    Parallel Charmm on the PC-Farm

    This is alittle tricky to set up the first time. Make a directory called bin on the pc-farm and copy the charmm27.mpi binary to there. Also copy the libpgc.a library to there (you'll find it on vanguard under /usr/pgi/linux86/lib). Edit your .bashrc file and add the line "export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr1/{USERNAME}/bin". To actually run a job you will need to use the mpisubmit command with something like "mpisubmit -n 8 -i charmm_input bin/charmm27.mpi". The max number of cpus to can get is currently 8. The ourput will be returned as a file called something like "chamm27.mpi.o1234.987654"

    Last update: Sat Oct 1 10:18:38 CST 2011 Comments to: jon _at_ sinica.edu.tw
    These pages were created using vim -a very much vi-improved.

    Opinions on these pages are generally not Academia Sinica's.