QOS
Different resource limits are associated with different QOS (Quality Of Services). A QOS is assigned automatically to a job depending on the demanded resources. To use an special QOS when sending a job, the option --qos=name_of_the_qos
must be used.
Note
Except in very special cases, you should not specify the QOS because it will be assigned automatically by the queuing system. Therefore, to avoid errors when submitting a job, do not use the option --qos
unless necessary.
The limits associated to the different QOS can change over time, so to enquire the existing limits associated with your account you can run the
batchlim
command as explained in partitions.
Name Priority GrpTRES MaxTRES MaxWall MaxJobsPU MaxTRESPU MaxSubmit
------------ ---------- ------------- ------------- ----------- --------- ------------- ---------
short 50 cpu=2048 50 cpu=2048 100
medium 40 cpu=2048 30 cpu=2048 50
long 30 cpu=8576 cpu=2048 5 cpu=2048 10
requeue 20 cpu=2048 5 cpu=2048 10
ondemand 10 cpu=4288 cpu=1024 2 cpu=1024 10
viz 50 2 cpu=64,gres/+ 2
class_a 200 cpu=4096 3-00:00:00 40 cpu=8192 100
class_b 40 cpu=4096 2-00:00:00 3 cpu=4096 15
class_c 10 cpu=4096 1-00:00:00 1 cpu=4096 5
special 30 cpu=16384 2 cpu=16384 10
clk_short 50 node=1 06:00:00 200 cpu=960 400
clk_medium 40 node=1 3-00:00:00 200 cpu=960 250
clk_long 30 cpu=1440 node=1 7-00:00:00 60 cpu=360 60
clk_ondemand 10 cpu=720 node=1 42-00:00:00 20 cpu=240 20
epyc_short 50 node=1 06:00:00 100 cpu=512 200
epyc_medium 40 node=1 3-00:00:00 100 cpu=512 200
ligo 200 node=1 2-00:00:00
otoom 50
- Brief explanation:
Name: it’s the name given to that QOS.
Priority: the priority of a job is the sum of different variables, including one properly called priority. The higher it is, the sooner a running job will enter.
GrpTRES: maximum number of CPUs that can be used among all members of a group.
MaxTRES: maximum resources that can be requested in a QOS.
MaxWall: maximum TimeLimit that can be requested for a job. These time limits can be extended by the members of the System’s department if it is necessary.
MaxJobsPU: refers to the maximum number of jobs an user can be running at the same time in that QOS.
MaxTRESPU: maximum number of CPUs an user can request in a QOS.
MaxSubmit: refers to the total number of jobs that an user can have on queue both running and waiting for resources (Pending).
User Account Limits
They appear in the Nodelist(Reason) when a job is launched. It should be emphasized that it is NOT an error: when the resources that have been requested when launching the job are available, it will start executing automatically. Some reasons that usually appear and their explanations are the following:
Priority: As explained above, the SLURM system relies on partitions, QOS limits and priorities to set the order in which jobs are executed. When this message appears in the Nodelist(Reason), it usually is because there is a high demand for resources and it will take some time for them to be released.
Dependency: appears when the start of one job is contingent on the completion of another.
Resources: the system is waiting for the requested resources to be released before it can start running the job.
QOSMaxJobsPerUserLimit: indicates that the maximum number of jobs that a user can launch simultaneously has been reached.
QOSMaxCpuPerUserLimit: indicates that the maximum number of CPUs that a user can use and request simultaneosly has been reached.
Assoc-Group Limits
You must consider that apart from the limits of the user account, there are also limits for the group or association to which your account belongs. The following messages can also appear in the NODELIST(Reason) and they are not errors. The most common messages are:
AssocGrpCpuLimit: this means that the maximum number of CPUs granted to the association or group are in use. The jobs will remain in queue waiting for the resources to be released. Keep in mind that if you request many nodes or the exclusivity use of them, the waiting time can be extended.
AssocGrpJobsLimit: same explanation as in the previous case but meaning that the limit of jobs has been reached.
AssocGrpGRES: similar to AssocGrpCpuLimit but in this case the limit refers to the GPUs.