.. _ft3_job_requeue: Job Requeue ======== How to split your neverending job (i.e. Revbayes) into smaller batches to extend the possibilities and enhance performance. **Intro** As we have mentioned in this guide, we have the SLURM scheduler to manage resources. We have partitions with different priority values to help balance resource asks. Your job is directed to one partition or another in terms of the time limit and the type of node that you are asking for. These partitions also imply some other limitations in terms of the number of jobs running or in queue per user at the same time or the total number of processors. You can see some of these limits with the command ``batchlim``. As you could see, these differences are especially notorious with jobs sent to the ondemand partition. .. code-block:: Name Priority GrpTRES MaxTRES MaxWall MaxJobsPU MaxTRESPU MaxSubmit ------------ ---------- ------------- ------------- ----------- --------- ------------- --- short 50 cpu=2048 50 cpu=2048 100 medium 40 cpu=2048 30 cpu=2048 50 long 30 cpu=8576 cpu=2048 5 cpu=2048 10 requeue 20 cpu=2048 5 cpu=2048 10 ondemand 10 cpu=4288 cpu=1024 2 cpu=1024 10 ... clk_short 50 node=1 06:00:00 200 cpu=960 400 clk_medium 40 node=1 3-00:00:00 200 cpu=960 250 clk_long 30 cpu=1440 node=1 7-00:00:00 60 cpu=360 60 clk_ondemand 10 cpu=720 node=1 42-00:00:00 20 cpu=240 20 ... This method makes use of system signaling, SLURM requeue capability, and application checkpointing to build a solution to this limitation while preserving the balance in the system. Summary and Generalization ------------------------- To break your long job into smaller batches you need 3 main tools. You need to catch the SIGTERM signal with the trap command in your submit script, add a requeue option when the program receives the signal, and also, ask the systems department to give requeue capabilities to your job (this step is mandatory to work and it’s not possible to be done by the user itself!) and finally, to manage application checkpointing procedure. For this template, the variable ``max_restarts`` is defined for controlling that your job doesn’t run forever without supervision but you can also define some other criteria to decide if you should do a requeue or not as meeting some convergence criterion. The next script is a general template that you can adapt to quickly set your job running. You should adapt what you need and add your application lines underneath. If you want to see the process in a real-world example go to the next section `Use case`_. .. code-block:: #!/bin/bash ############################################################## ## SBATCH directives, modify as needed ## #SBATCH -t 24:00:00 #SBATCH -C clk #Node type. Leave empty for ilk #SBATCH -J Rolling #SBATCH -o /PATH/TO/STDOUT.out #SBATCH -e /PATH/TO/STDOUT.err #SBATCH -N 1 #SBATCH --ntasks-per-node=1 #SBATCH -c 10 #SBATCH --mem-per-cpu=3G #SBATCH --signal=B:SIGTERM@30 ############################################################## ## Gather some information from the job and setting limits ## max_restarts=4 # tweak this number to fit your needs scontext=$(scontrol show job ${SLURM_JOB_ID}) restarts=$(echo ${scontext} | grep -o 'Restarts=[0-9]*****' | cut -d= -f2) outfile=$(echo ${scontext} | grep 'StdOut=' | cut -d= -f2) ## ## ############################################################## ## Build a term-handler function to be executed ## ## when the job gets the SIGTERM ## term_handler() { echo "Executing term handler at $(date)" if [[ $restarts -lt $max_restarts ]];then # Copy the log file because it will be overwriten cp -v "${outfile}" "${outfile}.${restarts}" scontrol requeue ${SLURM_JOB_ID} exit 0 else echo "Your job is over the Maximun restarts limit" exit 1 fi } ## Call the function when the jobs recieves the SIGTERM ## trap 'term_handler' SIGTERM # print some job-information cat < ``#SBATCH --signal=B:SIGTERM@360`` send a signal 6 minutes before the time limit is reached. **SLURM Requeue** That SLURM requeue command is performed when the job is about to die. It has some advantages over sending another job, especially in terms of priority conservation. For accounting reasons, it is also better to have that unique job with a unique job id and resource consumption. For now, the possibility of adding the requeue property to a job is only available to system administrators so we can check if everything is okay with the rolling job before the feature is unlocked. I.e. Are application checkpoints well defined and working properly? Is the maximum number of restarts well-defined? Once this asking is accepted you are almost done. **Application checkpointing** Your application must handle the checkpointing part itself. The need is to write checkpoints and to be able to restart from that point. In revbayes, it is done by modifying some part of your .rev file. Let's see. .. code-block:: DATASET = "clade_29-Buteogallus" # filepaths fp = "/PATH/TO/THE/FILES" in_fp = fp + "data/" out_fp = fp + "/" + DATASET + "_output/" check_fp = out_fp + DATASET + ".state" ... ... # To restart from the checkpoint file uncomment next line # mymcmc.initializeFromCheckpoint(check_fp) # To write checkpoints to a checkpoint file. Adjust the checkpoint interval mymcmc.run(generations=3000, tuningInterval=200, checkpointInterval=5, checkpointFile=check_fp) How it is implemented now is by having two different .Rev files that act as a launcher and a continuator. This could likely be performed on the same .Rev file. All of this is redirected from a config.txt file with separate lines for each case. **Review of the workflow** .. figure:: _static/screenshots/flujograma.png