Job signals

When a job reaches the requested time, it is killed by the queueing system but it may be the case that we may be interested in cleaning the working directory or recovering files to restart the job without losing everything done up to that moment.

Using the --signal option of the sbatch command we can tell our job to do certain tasks that interest us before reaching the time limit:

--signal=[B:]<sig_num>[@<sig_time>]

When a job is within sig_time seconds of its end time,it will send the signal. Due to the resolution of event handling by Slurm, the signal may be sent up to 60 seconds earlier than specified. sig_num may either be a signal number or name (e.g. “10” or “USR1”). sig_time must have an integer value between 0 and 65535. By default, no signal is sent before the job’s end time. If a sig_num is specified without any sig_time, the default time will be 60 seconds. Use the “B:” option to signal only the batch shell; none of the other processes will be signalled. By default all job steps will be signalled, but not the batch shell itself.

#!/bin/bash
...
#SBATCH –signal=B:USR1@120

In this example, we ask Slurm to send a signal to our script 120 seconds before it times out to give us a chance to perform cleanup actions.

The first step will be to include the --signal option to define that we want to send a signal (USR1) to the job for at time (120 seconds) before reaching the time limit. We must take into account that due to the management of the Slurm events, this signal can be sent up to 60 seconds before the indicated time. On the other hand, this indicated time must be enough to carry out all the tasks that we want because once the time is up, the work will be killed yes or yes.

Clean-up function

The next step can be the definition of a function in which we will carry out all the necessary tasks according to what we want to do.

# define the handler function. Note that this is not executed here, but rather when the associated signal is sent
your_cleanup_function()
{
echo “function your_cleanup_function called at $(date)”
# do whatever cleanup you want here
}

Next we must tell the job that we want to manage the signal indicated above using the function previously defined for it.

# call your_cleanup_function once we receive USR1 signal

trap 'your_cleanup_function' USR1

Finally, we must carry out the tasks of our job, as we have been doing, with the only exception that we must add an “&” at the end of the main task or tasks of the job and end the script with the wait function. If not, signals will not be caught.

echo “starting calculation at $(date)”

# the calculation “computes” (in this case sleeps) for 1000 seconds
# but we asked slurm only for 240 seconds so it will not finish
# the “&” after the compute step and “wait” are important

sleep 1000 &
wait

Another signal that we might be interested in would be the TERM signal, which would allow us to handle the deletion of a job with the scancel command. The problem with this signal is that, given the configuration of Slurm, we only have 30 seconds to carry out the desired tasks, so it would only allow us to do small tasks very quickly. We don’t recomend this command.

You can find an example script at: /opt/cesga/job-scripts-examples/job_signal_timeout.sh