Job Requeue
How to split your neverending job (i.e. Revbayes) into smaller batches to extend the possibilities and enhance performance.
Intro
As we have mentioned in this guide, we have the SLURM scheduler to manage resources. We have partitions with different priority values to help balance resource asks. Your job is directed to one partition or another in terms of the time limit and the type of node that you are asking for. These partitions also imply some other limitations in terms of the number of jobs running or in queue per user at the same time or the total number of processors. You can see some of these limits with the command batchlim
. As you could see, these differences are especially notorious with jobs sent to the ondemand partition.
Name Priority GrpTRES MaxTRES MaxWall MaxJobsPU MaxTRESPU MaxSubmit
------------ ---------- ------------- ------------- ----------- --------- ------------- ---
short 50 cpu=2048 50 cpu=2048 100
medium 40 cpu=2048 30 cpu=2048 50
long 30 cpu=8576 cpu=2048 5 cpu=2048 10
requeue 20 cpu=2048 5 cpu=2048 10
ondemand 10 cpu=4288 cpu=1024 2 cpu=1024 10
...
clk_short 50 node=1 06:00:00 200 cpu=960 400
clk_medium 40 node=1 3-00:00:00 200 cpu=960 250
clk_long 30 cpu=1440 node=1 7-00:00:00 60 cpu=360 60
clk_ondemand 10 cpu=720 node=1 42-00:00:00 20 cpu=240 20
...
This method makes use of system signaling, SLURM requeue capability, and application checkpointing to build a solution to this limitation while preserving the balance in the system.
Summary and Generalization
To break your long job into smaller batches you need 3 main tools. You need to catch the SIGTERM signal with the trap command in your submit script, add a requeue option when the program receives the signal, and also, ask the systems department to give requeue capabilities to your job (this step is mandatory to work and it’s not possible to be done by the user itself!) and finally, to manage application checkpointing procedure.
For this template, the variable max_restarts
is defined for controlling that your job doesn’t run forever without supervision but you can also define some other criteria to decide if you should do a requeue or not as meeting some convergence criterion.
The next script is a general template that you can adapt to quickly set your job running. You should adapt what you need and add your application lines underneath. If you want to see the process in a real-world example go to the next section [Use case].
#!/bin/bash
##############################################################
## SBATCH directives, modify as needed ##
#SBATCH -t 24:00:00
#SBATCH -C clk #Node type. Leave empty for ilk
#SBATCH -J Rolling
#SBATCH -o /PATH/TO/STDOUT.out
#SBATCH -e /PATH/TO/STDOUT.err
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH -c 10
#SBATCH --mem-per-cpu=3G
#SBATCH --signal=B:SIGTERM@30
##############################################################
## Gather some information from the job and setting limits ##
max_restarts=4 # tweak this number to fit your needs
scontext=$(scontrol show job ${SLURM_JOB_ID})
restarts=$(echo ${scontext} | grep -o 'Restarts=[0-9]*****' | cut -d= -f2)
outfile=$(echo ${scontext} | grep 'StdOut=' | cut -d= -f2)
## ##
##############################################################
## Build a term-handler function to be executed ##
## when the job gets the SIGTERM ##
term_handler()
{
echo "Executing term handler at $(date)"
if [[ $restarts -lt $max_restarts ]];then
# Copy the log file because it will be overwriten
cp -v "${outfile}" "${outfile}.${restarts}"
scontrol requeue ${SLURM_JOB_ID}
exit 0
else
echo "Your job is over the Maximun restarts limit"
exit 1
fi
}
## Call the function when the jobs recieves the SIGTERM ##
trap 'term_handler' SIGTERM
# print some job-information
cat <<EOF
SLURM_JOB_ID: $SLURM_JOB_ID
SLURM_JOB_NAME: $SLURM_JOB_NAME
SLURM_JOB_PARTITION: $SLURM_JOB_PARTITION
SLURM_SUBMIT_HOST: $SLURM_SUBMIT_HOST
Restarts: $restarts
EOF
## ##
##############################################################
## Here begins your actual program ##
## ##
## Modules to load ...
## srun application ...
Use case
This use case is intended to perform phylogenetic analysis in time-demanding runs. Also, the user has many cases to process completely independent so that can be done separately. The prior workflow is as follows:
Run a script that loops over the total number of cases submitting independent jobs.
Each job reads input of each line of a config.txt file and performs analysis with a time limit of 42 days, so it goes to the ondemand partition/QOS.
The job continues running while it reaches the time limit specified no matter if the convergence criterium is achieved or not.
At the end of the job the convergence is evaluated and wether it could be a successful run or not. If not, the run starts again from the beginning.
As you could see, this method has no guarantee of the correct use of computational resources so we start to look for a new workflow. First of all, let’s see the tools needed.
System signaling
Every process can listen to system signals while is running. Also when a job dies because of time limit, node failure, or some other system event it sends a terminating signal, SIGTERM.
We use the command trap
to manage what to do when a signal is cached. So we can use trap "echo I've got a signal" SIGTEM
to echo a message when a terminating signal is reached, in other words, when the job is dying.
We build a term handler function to manage these signals more efficiently.
term_handler()
{
echo "Executing term handler at $(date)"
export num_lin # A variable that comes from the previous step
if [[ $restarts -lt $max_restarts ]]; then
# Copy the logfile because will be overwritten by 2nd run
cp -v "${outfile}" "${outfile}.${restarts}"
scontrol requeue ${SLURM_JOB_ID}
exit 0
else
echo "Your job is over the Max Restart Limit"
exit 1
fi
}
And further in that same script:
trap 'term_handler' SIGTERM
With this function, we check if our job is over the maximum prefixed value of restarts, just for the sanity of the system, and if it’s not it does a SLURM requeue.
Note: we can modify at what time from the end of the job the signal is emitted with a sbatch directive #SBATCH --signal=B:SIGTERM@10
. In this directive, we tell the job to send a SIGTERM to the .batch job which handles the running application 10 seconds before the job death. Obviously, you should adapt this time to fits your needs. I.e. Your application can perform checkpoints ondemand but it takes 5 min to complete -> #SBATCH --signal=B:SIGTERM@360
send a signal 6 minutes before the time limit is reached.
SLURM Requeue
That SLURM requeue command is performed when the job is about to die. It has some advantages over sending another job, especially in terms of priority conservation. For accounting reasons, it is also better to have that unique job with a unique job id and resource consumption.
For now, the possibility of adding the requeue property to a job is only available to system administrators so we can check if everything is okay with the rolling job before the feature is unlocked. I.e. Are application checkpoints well defined and working properly? Is the maximum number of restarts well-defined?
Once this asking is accepted you are almost done.
Application checkpointing
Your application must handle the checkpointing part itself. The need is to write checkpoints and to be able to restart from that point.
In revbayes, it is done by modifying some part of your .rev file. Let’s see.
DATASET = "clade_29-Buteogallus"
# filepaths
fp = "/PATH/TO/THE/FILES"
in_fp = fp + "data/"
out_fp = fp + "/" + DATASET + "_output/"
check_fp = out_fp + DATASET + ".state"
...
...
# To restart from the checkpoint file uncomment next line
# mymcmc.initializeFromCheckpoint(check_fp)
# To write checkpoints to a checkpoint file. Adjust the checkpoint interval
mymcmc.run(generations=3000, tuningInterval=200, checkpointInterval=5, checkpointFile=check_fp)
How it is implemented now is by having two different .Rev files that act as a launcher and a continuator. This could likely be performed on the same .Rev file. All of this is redirected from a config.txt file with separate lines for each case.
Review of the workflow