Midlands e-Science Center University of Birmingham, dti e-Science Grid

Torque: Multiple jobs

The qsub'd script file is unique in that its contents are copied as part of the job, and there is no harm in modifying it or deleting it before the job runs. However, files that the job might use are not treated that way: Torque does not scan your script to see what files it might use and whether it might be useful to take a copy of them! And it certainly can't tell what files a compiled program might use, whose filenames might be embedded in the program's source.

So it's up to you to make sure that the files that a job needs exist and have the right contents for that job at the time the job runs. This is of course normally achieved by creating them before the qsub and not touching them until that job has finished!

If you are going to submit and/or run multiple similar (but slightly different) jobs at the same time, then you have to be careful how they are submitted. For example, if the jobs internally use a data input file whose contents are to be different for each job, then you have to find a sensible way of achieving that; if you simply modified the data input file and submitted a job, and then repeated that exercise, all the jobs would probably run with the same (final) version of the data!

There are various techniques that people use for this issue - when this is done many times it's usually worth writing a script which does the repetitive work for you:
  • If the datafile is short enough, and contains text and not binary data, then include the data in the job script to be submitted as a here-document (see man bash, or Google). The job script might be created dynamically for multiple jobs.
  • Create a newly named datafile for each job, and customise the job script so that it contains that datafile name, and then qsub that script. The script might be a dynamically created copy for multiple jobs.
  • As before, create a newly named datafile for each job, assign its name to an exported variable and pass that variable through to the job script with the qsub -v option. The job script can then be the same for every job.
  • As before, create a newly named datafile for each job, and create a short wrapper job script which just invokes the original job script with the datafile name as its argument, and then qsub the wrapper job script. The original job script uses $1 as the name of the data file to process, and can be the same for every job. The wrapper job script would be the place to specify any qsub options which weren't on the qsub command line.
  • Create all the datafiles, with unique names, and and for each datafile, use its name (or the variable part of it) as the name of each job, passed through using the qsub -N option. The job script can then pick up which data file to use by using the $PBS_JOBNAME parameter. The standard output/error file names will be based on that name too. If you follow that convention for output data files as well, then you will have no problem of name clashes even if all files are stored within the same directory.
  • Create a new directory for the job, with a unique name, and copy any files which need to be customised per job to it. Do the customisation. Then do the qsub with that directory as the current directory, so it will be passed through as $PBS_O_WORKDIR in the usual way. Output files could be written to this directory too, and you wouldn't have to worry about filenames being the same because they're in a unique directory.
  • Submit the same identical job script any required number of times. This job script has to be cleverer than the one-off case, as it needs to customise a data file at run-time, using criteria possibly from some steering file. You still have to use a unique name for the data file (unless you put it in /tmp and are using all the processors on each node) and you still have to be careful about naming output files. If you are dynamically reading and updating a steering file in order to decide what this particular job should do, you would need to use file-locking around reading and updating this file, because other jobs would be accessing it too. See the simple example in man lockfile.
  • Similar to the last one, but here create all the possible datafiles with unique names first, in a chosen directory. Then qsub the same number of jobs as there are datafiles. Each job chooses the next-available datafile from that directory, with suitable locking techniques surrounding the code which makes that choice.
Any further suggestions welcome!



L.S.Lowe