Basic commands
The purpose of this page is to assist users in managing their Slurm jobs by providing detailed information such as memory usage, CPU utilization, and guidance on how to use job statistics and information to troubleshoot any job failures.
sbatch
Send a script to a SLURM partition. The only mandatory parameters are the estimated time and the estimated memory per node/CPU. For example, to send a script called script.sh with a duration of 24 hours: sbatch -t 24:00:00 --mem=4GB script.sh
If the command is executed successfully, it returns the number of the job (<jobid>).
srun
Commonly used to run a parallel task on a script controlled by SLURM.
sinfo
Displays information about SLURM nodes and partitions. It also provides information about:
Existing partitions (PARTITION)
Whether or not they are available (AVAIL)
The maximum time of each partition (TIMELIMIT. If it is infinite then it is regulated externally)
The nodes belonging to each partition (NODES)
- Node state, the most common are:
idle: means available
alloc: means in use
mix: means part of your CPUs are available
resv: means reserved for an specific use
drain: means temporarily removed for technical reasons
Information about a specific partition:
sinfo -p <partitionname>
Information every 60 seconds:
sinfo -i60
List reasons nodes are in the down, drained, fail or failing state:
sinfo -R
squeue
Displays information about (your) jobs and their status in the Slurm scheduling queue.
State of a job with the jobid:
squeue -j <jobid>
Report the expected start time and resources to be allocated for pending jobs in order of increasing start time:
squeue --start
List all the running jobs:
squeue -t RUNNING
List all the pending jobs:
squeue -t PENDING
List the jobs demanding a specific partition:
squeue -p <partition name>
You can also see full list of job states.
scancel
It is used to signal or cancel jobs, job arrays or job steps
Cancel a job:
scancel <jobid>
Cancel all pending jobs:
scancel -t PENDING
Cancel one or more jobs with name “jobname”:
scancel --name <jobname>
Cancel all jobs:
scancel -u <username>
scontrol
Returns detailed information about the nodes, partitions, job steps, and configuration. It is used for monitoring and modifing queued jobs.
Show detailed information about a job:
scontrol show jobid -dd <jobid>
Write the batch script for a given job_id to a file or to stdout:
scontrol write batch_script <jobid> -
Prevent a pending job from being started (without cancel it):
scontrol hold <jobid>
Release a previously held job to begin execution:
scontrol release <jobid>
Requeue a running, suspended or finished Slurm batch job into pending state (equivalent to scancel + sbatch):
scontrol requeue <jobid>
sacct
Displays accounting data for all jobs and job steps. This command is used for jobs monitorization.
Job accounting query, displays accounting data for all jobs and job steps in the Slurm database:
sacct
Show the accounting information of a detailed job:
sacct -j <jobid>
The flag -l shows all the fields:
sacct -l
To show only specific fields:
sacct --format=JobID,JobName,State,NTasks,NodeList,Elapsed,ReqMem,MaxRSSNod
.Those fields are an example, you can see all the available ones using
sacct -e
.
seff
This command is useful to find the job efficiency for a jobs which has been completed and exited from the queue. If you run this command while the job is still in the R (Running) state, this might report incorrect information. This command shows inforamtion about the memory used, how much % of allocated memory is utilized, CPU information…
To execute this command:
seff <job-id>
sqstat
Detailed information about the queue system, resources consumption, status of all partitions and jobs: sqstat
For mor detailed information download SLURM commands guide. <http://slurm.schedmd.com/pdfs/summary.pdf