Different resource limits are associated with different QOS (Quality Of Services). A QOS is assigned automatically to a job depending on the demanded resources. To use an special QOS when sending a job, the option
--qos=name_of_the_qos must be used.
Except in very special cases, you should not specify the QOS because it will be assigned automatically by the queuing system. Therefore, to avoid errors when submitting a job, do not use the option
--qos unless necessary.
The limits associated to the different QOS can change over time, so to enquire the existing limits associated with your account you can run the
batchlim command as explained in partitions.
Name Priority GrpTRES MaxTRES MaxWall MaxJobsPU MaxTRESPU MaxSubmit ------------ ---------- ------------- ------------- ----------- --------- ------------- --------- short 50 cpu=2048 50 cpu=2048 100 medium 40 cpu=2048 30 cpu=2048 50 long 30 cpu=8576 cpu=2048 5 cpu=2048 10 requeue 20 cpu=2048 5 cpu=2048 10 ondemand 10 cpu=4288 cpu=1024 2 cpu=1024 10 viz 50 2 cpu=64,gres/+ 2 class_a 200 cpu=4096 3-00:00:00 40 cpu=8192 100 class_b 40 cpu=4096 2-00:00:00 3 cpu=4096 15 class_c 10 cpu=4096 1-00:00:00 1 cpu=4096 5 special 30 cpu=16384 2 cpu=16384 10 clk_short 50 node=1 06:00:00 200 cpu=960 400 clk_medium 40 node=1 3-00:00:00 200 cpu=960 250 clk_long 30 cpu=1440 node=1 7-00:00:00 60 cpu=360 60 clk_ondemand 10 cpu=720 node=1 42-00:00:00 20 cpu=240 20 epyc_short 50 node=1 06:00:00 100 cpu=512 200 epyc_medium 40 node=1 3-00:00:00 100 cpu=512 200 ligo 200 node=1 2-00:00:00 otoom 50
- Brief explanation:
Name: it’s the name given to that QOS.
Priority: the priority of a job is the sum of different variables, including one properly called priority. The higher it is, the sooner a running job will enter.
GrpTRES: maximum number of CPUs that can be used among all members of a group.
MaxTRES: maximum resources that can be requested in a QOS.
MaxWall: maximum TimeLimit that can be requested for a job. These time limits can be extended by the members of the System’s department if it is necessary.
MaxJobsPU: refers to the maximum number of jobs an user can be running at the same time in that QOS.
MaxTRESPU: maximum number of CPUs an user can request in a QOS.
MaxSubmit: refers to the total number of jobs that an user can have on queue both running and waiting for resources (Pending).
User Account Limits
They appear in the Nodelist(Reason) when a job is launched. It should be emphasized that it is NOT an error: when the resources that have been requested when launching the job are available, it will start executing automatically. Some reasons that usually appear and their explanations are the following:
Priority: As explained above, the SLURM system relies on partitions, QOS limits and priorities to set the order in which jobs are executed. When this message appears in the Nodelist(Reason), it usually is because there is a high demand for resources and it will take some time for them to be released.
Dependency: appears when the start of one job is contingent on the completion of another.
Resources: the system is waiting for the requested resources to be released before it can start running the job.
QOSMaxJobsPerUserLimit: indicates that the maximum number of jobs that a user can launch simultaneously has been reached.
QOSMaxCpuPerUserLimit: indicates that the maximum number of CPUs that a user can use and request simultaneosly has been reached.
You must consider that apart from the limits of the user account, there are also limits for the group or association to which your account belongs. The following messages can also appear in the NODELIST(Reason) and they are not errors. The most common messages are:
AssocGrpCpuLimit: this means that the maximum number of CPUs granted to the association or group are in use. The jobs will remain in queue waiting for the resources to be released. Keep in mind that if you request many nodes or the exclusivity use of them, the waiting time can be extended.
AssocGrpJobsLimit: same explanation as in the previous case but meaning that the limit of jobs has been reached.
AssocGrpGRES: similar to AssocGrpCpuLimit but in this case the limit refers to the GPUs.