AI nodes (GPU nodes)

To request the use of a GPU in a job the option --gres=gpu must be specified. This command has some options wich can be useful.

--gres=<list>: Specifies a comma delimited list of generic consumable resources. The format of each entry on the list is “name[[:type]:count]”. The name is that of the consumable resource. The count is the number of those resources with a default value of 1. The specified resources will be allocated to the job on each node. The available generic consumable resources is configurable by the system administrator. A list of available generic consumable resources will be printed and the command will exit if the option argument is “help”. Examples of use include: --gres=gpu, --gres=gpu:2, and --gres=help.

--gres-flags=enforce-binding: If set, the only CPUs available to the job will be those bound to the selected GRES (i.e. the CPUs identified in the gres.conf file will be strictly enforced rather than advisory). This option may result in delayed initiation of a job. For example a job requiring two GPUs and one CPU will be delayed until both GPUs on a single socket are available rather than using GPUs bound to separate sockets, however the application performance may be improved due to improved communication speed. Requires the node to be configured with more than one socket and resource filtering will be performed on a per-socket basis.

The following GPU models are available on FinisTerrae III:

NVIDIA A100

../_images/nvidia-a100-pcie-3qtr-top-left-2c50-d.jpg

NVIDIA A100-PCIE-40GB

  • CUDA Driver Version / Runtime Version: 11.5 / 11.2

  • CUDA Capability Major/Minor version number: 8.0

  • MEMORY: 40 GB of HBM2 BANDWIDTH 1555 GB/s

  • 108 Multiprocessors, 64 CUDA Cores/MP: 6912 CUDA Cores

  • GPU Max Clock rate: 1410 MHz (1.41 GHz)

The average NVIDIA A100 nodes have 2 GPUs per node, you can request the use of 1 or 2 GPUs with the option --gres=gpu:N where N is 1-2. There are also two new special nodes with more GPUs per node: * 5x NVIDIA A100: to use this node, set –gres=gpu:N where N is a value between 3-5. * 8x NVIDIA A100: to use this node, set –gres=gpu:N where N is a value between 6-8.

Use:

$ srun --gres=gpu:a100 -c 32 --mem=64G -t 20 nvidia-smi topo -m
$ srun --gres=gpu:a100:2 -c 64 --mem=128G -t 20 nvidia-smi topo -m

Warning

  • cpus requested for the 2x NVIDIA A100 nodes must be 32 per GPU requested.

  • cpus requested for the 5x NVIDIA A100 node must be 12 per GPU requested.

  • cpus resquested for the 8X NVIDIA A100 node must be 8 per GPU requested.

You can find some script examples here.

Tesla T4

../_images/t4-tensor-core.png

Tesla T4

  • CUDA Driver Version / Runtime Version: 11.5 / 11.2

  • CUDA Capability Major/Minor version number: 7.5

  • MEMORY: 16 GB of GDDR6 BANDWIDTH 320 GB/s

  • 40 Multiprocessors, 64 CUDA Cores/MP: 2560 CUDA Cores

  • GPU Max Clock rate: 1590 MHz (1.59 GHz)

Interactive use:

$ compute --gpu
$ srun -p viz --gres=gpu:t4 --mem=8G -t 20 nvidia-smi topo -m