(ft3_job_requeue)= % :title: Job Requeue % :section: ft3-user-guide % :url_source: https://cesga-docs.gitlab.io/ft3-user-guide/job_requeue.html % :topic: job_requeue % :keywords: SLURM, SIGTERM, TRAP, REQUEUE, SLURM_JOB_ID, SLURM_JOB_NAME, SLURM_JOB_PARTITION, SLURM_SUBMIT_HOST, RESTARTS, MAX_RESTARTS, SCONTROL, SBATCH, SIGNAL, B:SIGTERM@10, CHECKPOINTING, REVBAYES, .REV FILE, DATASET, FP, IN_FP, OUT_FP, CHECK_FP, MYMCMC, INITIALIZEFROMCHECKPOINT, RUN, GENERATIONS, TUNINGINTERVAL, CHECKPOINTINTERVAL, CHECKPOINTFILE % :content: # Job Requeue How to split your neverending job (i.e. Revbayes) into smaller batches to extend the possibilities and enhance performance. **Intro** As we have mentioned in this guide, we have the SLURM scheduler to manage resources. We have partitions with different priority values to help balance resource asks. Your job is directed to one partition or another in terms of the time limit and the type of node that you are asking for. These partitions also imply some other limitations in terms of the number of jobs running or in queue per user at the same time or the total number of processors. You can see some of these limits with the command `batchlim`. As you could see, these differences are especially notorious with jobs sent to the ondemand partition. ``` Name Priority GrpTRES MaxTRES MaxWall MaxJobsPU MaxTRESPU MaxSubmit ------------ ---------- ------------- ------------- ----------- --------- ------------- --- short 50 cpu=2048 50 cpu=2048 100 medium 40 cpu=2048 30 cpu=2048 50 long 30 cpu=8576 cpu=2048 5 cpu=2048 10 requeue 20 cpu=2048 5 cpu=2048 10 ondemand 10 cpu=4288 cpu=1024 2 cpu=1024 10 ... clk_short 50 node=1 06:00:00 200 cpu=960 400 clk_medium 40 node=1 3-00:00:00 200 cpu=960 250 clk_long 30 cpu=1440 node=1 7-00:00:00 60 cpu=360 60 clk_ondemand 10 cpu=720 node=1 42-00:00:00 20 cpu=240 20 ... ``` This method makes use of system signaling, SLURM requeue capability, and application checkpointing to build a solution to this limitation while preserving the balance in the system. ## Summary and Generalization To break your long job into smaller batches you need 3 main tools. You need to catch the SIGTERM signal with the trap command in your submit script, add a requeue option when the program receives the signal, and also, ask the systems department to give requeue capabilities to your job (this step is mandatory to work and it’s not possible to be done by the user itself!) and finally, to manage application checkpointing procedure. For this template, the variable `max_restarts` is defined for controlling that your job doesn’t run forever without supervision but you can also define some other criteria to decide if you should do a requeue or not as meeting some convergence criterion. The next script is a general template that you can adapt to quickly set your job running. You should adapt what you need and add your application lines underneath. If you want to see the process in a real-world example go to the next section [Use case]. ```bash #!/bin/bash ############################################################## ## SBATCH directives, modify as needed ## #SBATCH -t 24:00:00 #SBATCH -C clk #Node type. Leave empty for ilk #SBATCH -J Rolling #SBATCH -o /PATH/TO/STDOUT.out #SBATCH -e /PATH/TO/STDOUT.err #SBATCH -N 1 #SBATCH --ntasks-per-node=1 #SBATCH -c 10 #SBATCH --mem-per-cpu=3G #SBATCH --signal=B:SIGTERM@30 ############################################################## ## Gather some information from the job and setting limits ## max_restarts=4 # tweak this number to fit your needs scontext=$(scontrol show job ${SLURM_JOB_ID}) restarts=$(echo ${scontext} | grep -o 'Restarts=[0-9]*****' | cut -d= -f2) outfile=$(echo ${scontext} | grep 'StdOut=' | cut -d= -f2) ## ## ############################################################## ## Build a term-handler function to be executed ## ## when the job gets the SIGTERM ## term_handler() { echo "Executing term handler at $(date)" if [[ $restarts -lt $max_restarts ]];then # Copy the log file because it will be overwriten cp -v "${outfile}" "${outfile}.${restarts}" scontrol requeue ${SLURM_JOB_ID} exit 0 else echo "Your job is over the Maximun restarts limit" exit 1 fi } ## Call the function when the jobs recieves the SIGTERM ## trap 'term_handler' SIGTERM # print some job-information cat <<EOF SLURM_JOB_ID: $SLURM_JOB_ID SLURM_JOB_NAME: $SLURM_JOB_NAME SLURM_JOB_PARTITION: $SLURM_JOB_PARTITION SLURM_SUBMIT_HOST: $SLURM_SUBMIT_HOST Restarts: $restarts EOF ## ## ############################################################## ## Here begins your actual program ## ## ## ## Modules to load ... ## srun application ... ``` ## Use case This use case is intended to perform phylogenetic analysis in time-demanding runs. Also, the user has many cases to process completely independent so that can be done separately. The prior workflow is as follows: - Run a script that loops over the total number of cases submitting independent jobs. - Each job reads input of each line of a config.txt file and performs analysis with a time limit of 42 days, so it goes to the ondemand partition/QOS. - The job continues running while it reaches the time limit specified no matter if the convergence criterium is achieved or not. - At the end of the job the convergence is evaluated and wether it could be a successful run or not. If not, the run starts again from the beginning. As you could see, this method has no guarantee of the correct use of computational resources so we start to look for a new workflow. First of all, let's see the tools needed. **System signaling** Every process can listen to system signals while is running. Also when a job dies because of time limit, node failure, or some other system event it sends a terminating signal, SIGTERM. We use the command `trap` to manage what to do when a signal is cached. So we can use `trap "echo I've got a signal" SIGTEM` to echo a message when a terminating signal is reached, in other words, when the job is dying. We build a term handler function to manage these signals more efficiently. ``` term_handler() { echo "Executing term handler at $(date)" export num_lin # A variable that comes from the previous step if [[ $restarts -lt $max_restarts ]]; then # Copy the logfile because will be overwritten by 2nd run cp -v "${outfile}" "${outfile}.${restarts}" scontrol requeue ${SLURM_JOB_ID} exit 0 else echo "Your job is over the Max Restart Limit" exit 1 fi } ``` And further in that same script: ``` trap 'term_handler' SIGTERM ``` With this function, we check if our job is over the maximum prefixed value of restarts, just for the sanity of the system, and if it's not it does a SLURM requeue. **Note**: we can modify at what time from the end of the job the signal is emitted with a sbatch directive `#SBATCH --signal=B:SIGTERM@10`. In this directive, we tell the job to send a SIGTERM to the .batch job which handles the running application 10 seconds before the job death. Obviously, you should adapt this time to fits your needs. I.e. Your application can perform checkpoints ondemand but it takes 5 min to complete -> `#SBATCH --signal=B:SIGTERM@360` send a signal 6 minutes before the time limit is reached. **SLURM Requeue** That SLURM requeue command is performed when the job is about to die. It has some advantages over sending another job, especially in terms of priority conservation. For accounting reasons, it is also better to have that unique job with a unique job id and resource consumption. For now, the possibility of adding the requeue property to a job is only available to system administrators so we can check if everything is okay with the rolling job before the feature is unlocked. I.e. Are application checkpoints well defined and working properly? Is the maximum number of restarts well-defined? Once this asking is accepted you are almost done. **Application checkpointing** Your application must handle the checkpointing part itself. The need is to write checkpoints and to be able to restart from that point. In revbayes, it is done by modifying some part of your .rev file. Let's see. ``` DATASET = "clade_29-Buteogallus" # filepaths fp = "/PATH/TO/THE/FILES" in_fp = fp + "data/" out_fp = fp + "/" + DATASET + "_output/" check_fp = out_fp + DATASET + ".state" ... ... # To restart from the checkpoint file uncomment next line # mymcmc.initializeFromCheckpoint(check_fp) # To write checkpoints to a checkpoint file. Adjust the checkpoint interval mymcmc.run(generations=3000, tuningInterval=200, checkpointInterval=5, checkpointFile=check_fp) ``` How it is implemented now is by having two different .Rev files that act as a launcher and a continuator. This could likely be performed on the same .Rev file. All of this is redirected from a config.txt file with separate lines for each case. **Review of the workflow** :::{figure} _static/screenshots/flujograma.png :::