Parallel tasks without MPI

Sometimes, some tasks can be usefully performed in parallel without the need to use an MPI, and for these the pbsdsh command is useful. Here is an example of a 8 processor job using pbsdsh:

#!/bin/sh
#PBS -l nodes=4:ppn=2
#PBS -l walltime=5:00:00,cput=20:00:00
#PBS -j oe
.... initial processing ....
pbsdsh -v $PBS_O_WORKDIR/myscript
.... final processing ....

Since the same "myscript" is run on each of the processor cores of a job, that script needs to be clever enough to decide what its role is. Of course, if the task is identical on every processor core, then that's simple. But in the case where each processor core should be doing a different task, then you can make use of an environmental variable called $PBS_VNODENUM. This variable takes a value from 0 to c-1, where c is the number of processor cores allocated to the job, and is set by the torque system when it invokes the pbsdsh'd script on each core. So if you have pre-prepared several lower-level scripts named mysub.0 to mysub.7, your file "myscript" might contain:

#!/bin/sh
cd $PBS_O_WORKDIR
PATH=$PBS_O_PATH
sh mysub.$PBS_VNODENUM
or, if you have pre-prepared a program myprog and a set of different data-files, mydata.0 to mydata.7, for the tasks, then
#!/bin/sh
cd $PBS_O_WORKDIR
PATH=$PBS_O_PATH
myprog < mydata.$PBS_VNODENUM
Let me know of other, innovative methods of using pbsdsh.

Note that there is also the variable $PBS_NODENUM, which has a unique number 0 upwards for each different node, so 0 to 3 in the above example, but observe that this is not so useful in the above context as $PBS_VNODENUM. Also there is the variable $PBS_TASKNUM, which is incremented before each task on each core is started.

Initial environment of a script invoked by pbsdsh

A script invoked by pbsdsh starts in a very basic environment: the user's $HOME directory is defined and is the current directory, the LANG variable is set to C, and the PATH is set to the basic /usr/local/bin:/usr/bin:/bin as defined in a system-wide file pbs_environment.  Nothing that would normally be set up by a system shell profile or user shell profile is defined, unlike the environment for the main job script. To be positive about this, you could say that it this is very efficient, particularly if you use pbsdsh repeatedly in your main job script, as it eliminates unnecessary overheads!

The first thing such a script is likely to need to do, therefore, is to change directory to $PBS_O_WORKDIR, and to set the PATH to $PBS_O_PATH. Be careful, because this approach assumes that when you submit the job, the environment in which you submit it is the one you want when it is running. Alternatively, it might be sensible for the script to source a file containing all the definitions of environment that your job script requires.

Yet another choice is for the pbsdsh command in your main job script to invoke your script via a shell, like sh or bash, with or without the "-l" login-shell option, so that it gives an initialised environment for each instance: for example:

    pbsdsh bash -l -c  '$PBS_O_WORKDIR/myscript'

In detail, the initial environment of a command invoked by pbsdsh has the following defined, listed alphabetically. Notice that this list of variable names is the same list as for a main job script (see the Torque details page), except that PBS_NODEFILE is not defined on secondary nodes.

ENVIRONMENT
HOME
LANG
PATH
PBS_ENVIRONMENT
PBS_JOBCOOKIE
PBS_JOBID
PBS_JOBNAME
PBS_MOMPORT
PBS_NODENUM
PBS_O_HOME
PBS_O_HOST
PBS_O_LANG
PBS_O_LOGNAME
PBS_O_MAIL
PBS_O_PATH
PBS_O_QUEUE
PBS_O_SHELL
PBS_O_WORKDIR
PBS_QUEUE
PBS_TASKNUM
PBS_VNODENUM

Questions of efficiency when running multi-core jobs

When considering running different processes on different nodes/cores as part of a multi-core job, be aware that some processes may finish well before others. Therefore the cores that those processes were using will be idle until all the pbsdsh-invoked processes have finished. Your job effectively reserves all the cores you requested for the total duration of the job: busy or not.

Some inefficiencies are inevitable in this sort of parallel environment, if the parts running in parallel are not identical. But this can make the cluster as a whole inefficient. Your user and group fair-shares are based on core wall-time occupancy, not on actual processing, so idle cores are still charged for in fair-share terms, and will count against you and your group for future jobs. So do not devise jobs to work in a parallel way, if there is little benefit in doing so, if they can perfectly adequately run as multiple single-core jobs.


L.S.Lowe