QOS

Different resource limits are associated with different QOS (Quality Of Services). A QOS is assigned automatically to a job depending on the demanded resources. To use an special QOS when sending a job, the option --qos=name_of_the_qos must be used.

Note

Except in very special cases, you should not specify the QOS because it will be assigned automatically by the queuing system. Therefore, to avoid errors when submitting a job, do not use the option --qos unless necessary.

The limits associated to the different QOS can change over time, so to enquire the existing limits associated with your account you can run the batchlim command as explained in partitions.

Name           Priority       GrpTRES       MaxTRES     MaxWall MaxJobsPU     MaxTRESPU MaxSubmit
------------ ---------- ------------- ------------- ----------- --------- ------------- ---------
short                50                    cpu=2048                    50      cpu=2048       100
medium               40                    cpu=2048                    30      cpu=2048        50
long                 30      cpu=8576      cpu=2048                     5      cpu=2048        10
requeue              20                    cpu=2048                     5      cpu=2048        10
ondemand             10      cpu=4288      cpu=1024                     2      cpu=1024        10
viz                  50                                                 2 cpu=64,gres/+         2
class_a             200                    cpu=4096  3-00:00:00        40      cpu=8192       100
class_b              40                    cpu=4096  2-00:00:00         3      cpu=4096        15
class_c              10                    cpu=4096  1-00:00:00         1      cpu=4096         5
special              30                   cpu=16384                     2     cpu=16384        10
clk_short            50                      node=1    06:00:00       200       cpu=960       400
clk_medium           40                      node=1  3-00:00:00       200       cpu=960       250
clk_long             30      cpu=1440        node=1  7-00:00:00        60       cpu=360        60
clk_ondemand         10       cpu=720        node=1 42-00:00:00        20       cpu=240        20
epyc_short           50                      node=1    06:00:00       100       cpu=512       200
epyc_medium          40                      node=1  3-00:00:00       100       cpu=512       200
ligo                200                      node=1  2-00:00:00
otoom                50

Brief explanation: : - Name: it’s the name given to that QOS.

  • Priority: the priority of a job is the sum of different variables, including one properly called priority. The higher it is, the sooner a running job will enter.

  • GrpTRES: maximum number of CPUs that can be used among all members of a group.

  • MaxTRES: maximum resources that can be requested in a QOS.

  • MaxWall: maximum TimeLimit that can be requested for a job. These time limits can be extended by the members of the System’s department if it is necessary.

  • MaxJobsPU: refers to the maximum number of jobs an user can be running at the same time in that QOS.

  • MaxTRESPU: maximum number of CPUs an user can request in a QOS.

  • MaxSubmit: refers to the total number of jobs that an user can have on queue both running and waiting for resources (Pending).

User Account Limits

They appear in the Nodelist(Reason) when a job is launched. It should be emphasized that it is NOT an error: when the resources that have been requested when launching the job are available, it will start executing automatically. Some reasons that usually appear and their explanations are the following:

  • Priority: As explained above, the SLURM system relies on partitions, QOS limits and priorities to set the order in which jobs are executed. When this message appears in the Nodelist(Reason), it usually is because there is a high demand for resources and it will take some time for them to be released.

  • Dependency: appears when the start of one job is contingent on the completion of another.

  • Resources: the system is waiting for the requested resources to be released before it can start running the job.

  • QOSMaxJobsPerUserLimit: indicates that the maximum number of jobs that a user can launch simultaneously has been reached.

  • QOSMaxCpuPerUserLimit: indicates that the maximum number of CPUs that a user can use and request simultaneosly has been reached.

Assoc-Group Limits

You must consider that apart from the limits of the user account, there are also limits for the group or association to which your account belongs. The following messages can also appear in the NODELIST(Reason) and they are not errors. The most common messages are:

  • AssocGrpCpuLimit: this means that the maximum number of CPUs granted to the association or group are in use. The jobs will remain in queue waiting for the resources to be released. Keep in mind that if you request many nodes or the exclusivity use of them, the waiting time can be extended.

  • AssocGrpJobsLimit: same explanation as in the previous case but meaning that the limit of jobs has been reached.

  • AssocGrpGRES: similar to AssocGrpCpuLimit but in this case the limit refers to the GPUs.