Basic commands

The purpose of this page is to assist users in managing their Slurm jobs by providing detailed information such as memory usage, CPU utilization, and guidance on how to use job statistics and information to troubleshoot any job failures.

sbatch

Send a script to a SLURM partition. The only mandatory parameters are the estimated time and the estimated memory per node/CPU. For example, to send a script called script.sh with a duration of 24 hours: sbatch -t 24:00:00 --mem=4GB script.sh

If the command is executed successfully, it returns the number of the job (<jobid>).

srun

Commonly used to run a parallel task on a script controlled by SLURM.

sinfo

Displays information about SLURM nodes and partitions. It also provides information about:

  • Existing partitions (PARTITION)

  • Whether or not they are available (AVAIL)

  • The maximum time of each partition (TIMELIMIT. If it is infinite then it is regulated externally)

  • The nodes belonging to each partition (NODES)

  • Node state, the most common are: : - idle: means available

    • alloc: means in use

    • mix: means part of your CPUs are available

    • resv: means reserved for an specific use

    • drain: means temporarily removed for technical reasons

  • Information about a specific partition: sinfo -p <partitionname>

  • Information every 60 seconds: sinfo -i60

  • List reasons nodes are in the down, drained, fail or failing state: sinfo -R

squeue

Displays information about (your) jobs and their status in the Slurm scheduling queue.

  • State of a job with the jobid: squeue -j <jobid>

  • Report the expected start time and resources to be allocated for pending jobs in order of increasing start time: squeue --start

  • List all the running jobs: squeue -t RUNNING

  • List all the pending jobs: squeue -t PENDING

  • List the jobs demanding a specific partition: squeue -p <partition name>

You can also see full list of job states.

scancel

It is used to signal or cancel jobs, job arrays or job steps

  • Cancel a job: scancel <jobid>

  • Cancel all pending jobs: scancel -t PENDING

  • Cancel one or more jobs with name “jobname”: scancel --name <jobname>

  • Cancel all jobs: scancel -u <username>

scontrol

Returns detailed information about the nodes, partitions, job steps, and configuration. It is used for monitoring and modifing queued jobs.

  • Show detailed information about a job: scontrol show jobid -dd <jobid>

  • Write the batch script for a given job_id to a file or to stdout: scontrol write batch_script <jobid> -

  • Prevent a pending job from being started (without cancel it): scontrol hold <jobid>

  • Release a previously held job to begin execution: scontrol release <jobid>

  • Requeue a running, suspended or finished Slurm batch job into pending state (equivalent to scancel + sbatch): scontrol requeue <jobid>

sacct

Displays accounting data for all jobs and job steps. This command is used for jobs monitorization.

  • Job accounting query, displays accounting data for all jobs and job steps in the Slurm database: sacct

  • Show the accounting information of a detailed job: sacct -j <jobid>

  • The flag -l shows all the fields: sacct -l

  • To show only specific fields: sacct --format=JobID,JobName,State,NTasks,NodeList,Elapsed,ReqMem,MaxRSSNod.

  • Those fields are an example, you can see all the available ones using sacct -e.

seff

This command is useful to find the job efficiency for a jobs which has been completed and exited from the queue. If you run this command while the job is still in the R (Running) state, this might report incorrect information. This command shows inforamtion about the memory used, how much % of allocated memory is utilized, CPU information…

  • To execute this command: seff <job-id>

sqstat

Detailed information about the queue system, resources consumption, status of all partitions and jobs: sqstat

For mor detailed information download SLURM commands guide. <http://slurm.schedmd.com/pdfs/summary.pdf