In the User Portal > Information > Links you can find the FinisTerrae III workshops that will help you understand the system and its use. It’s strongly recommended to read them before connecting.
How can I connect to FinisTerrae III?
There are different ways to be able to connect with the server and use it. From the User portal > Tools you can access to an SSH Terminal and a Remote desktop but it must be taken into account that Remote Desktops have a maximum time duration of 24 hours (if the session is not renewed) and the maximum memory allowed is 32GB. The advantage of remote desktops is that, due to their web interface, they are much easier to use than a command terminal.
The other ways to connect and use FinisTerrae III are through interactive sessions with the
compute command and via SSH using the batch system.
More information about system use: System Use.
I cannot access to any CESGA server.
If this is happening outside your research center or university and you can connect without any problem from them, it is due to VPN. In order to use the CESGA resources outside these authorized centers you must use the VPN. It is mandatory to use Checkpoint for any operation system and follow the steps explained in How to connect to install and configure the connection.
What are the credentials used for authentication in Checkpoint?
These credentials are the same ones used to access FinisTerrae III or other services offered by CESGA. That is, it is the username that was granted when registering for CESGA services.
DO NOT ENTER THE EMAIL OR DOMAIN @FT3.CESGA.ES. Only enter the username, everything that precedes the @ symbol. For example, if you use email@example.com to connect to FinisTerrae III, the username that should be entered in the CheckPoint credentials is user_cesga.
Use of FinisTerrae III
How can FinisTerrae III be used?
You can access FinisTerrae III through interactive sessions using the
compute command. It must be taken into account that the resources of this type of sessions are limited to 64 cores, 247GB of memory per core and a maximum of 8 hours.
Another option is to use the batch system, indicated for all those cases in which the previous option do not satisfy the needs to carry out the simulations. This option allows you to get the most out of FinisTerrae III, allowing the use of multiple nodes and cores (options for parallelizing jobs), using large amounts of memory, running jobs for a longer time and making use of GPU accelerators in desired.
How to know which applications and modules are installed? How do I load a module?
To know the modules and applications installed in FinisTerrae III, we recommed to use the command
module spider. It shows a list and a brief description of the modules. To search a specific module:
module spider <module_name>. If there are different versions of that module, it would appear listed. For an extended description of the module, use and load, you will have to specify the version.
To load a module, you must use
module load <module_name> and version if required.
With the command
equery you can also see other installed modules. To see them all, just run the command
equery l "*". For example, If you want to search for zlib:
equery l zlib.
If the module or application you want to use is not installed, you must contact the Applications department and make an installation request.
More information at: Environment modules.
Is it possible to connect to the nodes where I have a job running?
It is possible to ssh to the node where the job is running. The list of nodes on which the job is running can be obtained with the
squeue command. For example, if the job is running on ilk-20, you would have to do
ssh ilk-20. Next in line, it will ask you for your password and once it’s authentified, you will be connected to that node.
It must be taken into account that it only allowes the connection to nodes where the user is executing a job and, once the job has finished, it will kick the user out returning to one of the login nodes.
Why do I get errors or warnings when I try to launch a job?
There can be many reasons why errors or warnings may arise but the most common are:
It is mandatory to indicate the memory and the time that the job will need with the parameters
-t D-HH:MM:SS. If one of these parameters are not entered, the error messages will be as follows:
sbatch: error: Batch job submission failed: Time limit specification required, but not provided
sbatch: error: slurm_job_submit: Neither –mem nor –mem-per-cpu specified
Warnings associated with User Account Limits. They appear in the Nodelist(Reason) when a job is launched. It should be emphasized that it is NOT an error, when the resources that have been requested when launching the job are available, it will start executing automatically. Some reasons that usually appear and their explanations are the following:
Priority: As explained above, the SLURM system relies on partitions, QOS limits and priorities to set the order in which jobs are executed. When this message appears in the Nodelist(Reason) it is usually because there is a high demand for resources and it will take some time for them to be released.
Dependency: appears when the start of one job is contingent on the completion of another.
Resources: the system is waiting for the requested resources to be released before it can start running the job.
QOSMaxJobsPerUserLimit: indicates that the maximum number of jobs that a user can launch simultaneously has been reached.
QOSMaxCpuPerUserLimit: indicates that the maximum number of CPUs that a user can use and request simultaneously has been reached.
AssocGrpCpuLimit: the maximum number of CPUs granted to a group or association to which the user belongs is being used.
AssocGrpJobsLimit: same explanation as in the previous case but referring to the job limit.
AssocGrpGRES: similar to the two previous cases but for the GPU limit.
Therefore, it must be taken into account that there are restrictions at the user level but also at the global level of the group to which they belong to. You can use the command
batchlim to obtain the partitions and QOS limits for your account.
Why is my job not running? How much time will I have to wait? What can I do to reduce the waiting times?
Most common reasons because a job is still pending could be:
AssociationJobLimit: The job’s association has reached its maximum job count
AssociationResourceLimit: The job’s association has reached some resource limit
AssociationTimeLimit: The job’s association has reached its time limit
BadConstraints: The job’s constraints can not be satisfied
Dependency: This job is waiting for a dependent job to complete
JobHeldAdmin: The job is held by a system administrator
JobHeldUser: The job is held by the user
Priority: One or more higher priority jobs exist for this partition
QOSResourceLimit: The job’s QOS has reached some resource limit. (QOSMaxJobsPerUserLimit)
Resources: The job is waiting for resources to become available
Also, you can use the command
squeue --start <jobid> to know when your job will be starting. Sometimes the system can’t provide this information because it doesn’t know yet when the resources would be available.
To reduce the time a job is waiting to start its execution, you should adapt the real requirements of resources of the job. If you request a lot more cores, memory or time than the job really needs, you will have to wait way more until the amount of resources are available.
How can I know how much memory are using my jobs? or How can I adjust the memory request for job submission?
Sometimes it can be very difficult to determine how much memory a job will require. For this purpose, we recommend that you conduct tests with different amounts of memory. You can use the
seff <jobid> command to check the actual memory used by your job and adjust your memory request accordingly. The output of this command is approximate, so it should not be taken literally but rather as an approximation:
$ seff 2106200 Job ID: 2106200 Cluster: finisterrae3 User/Group: State: COMPLETED (exit code 0) Cores: 1 CPU Utilized: 1-22:39:55 C PU Efficiency: 99.64% of 1-22:50:01 core-walltime Job Wall-clock time: 1-22:50:01 Memory Utilized: 129.04 MB Memory Efficiency: 4.20% of 3.00 GB
For this example, the job request of 3GB of RAM memory seems too high, so it should be adjusted.
I can’t list or load modules using “module” or “module spider”
We are having some issues listing and loading modules using spider. The error message is: “/usr/bin/lua: /usr/share/lmod/lmod/libexec/FrameStk.lua:125: attempt to index a boolean value (local ‘mname’)”.
To fix this you can try loading the module with the following command:
module --ignore-cache spider. If this does not resolve the issue, you may need to delete the .lmod directory located in your $HOME:
rm -rf $HOME/.lmod.d
How much storage space is being used?
You have to use the
myquota command which will show you the information of all the directories ($HOME, $STORE and $LUSTRE). If you are in the $LUSTRE directory, you can also use
lfs quota command. Note that if there is not enough space in the directory where the job results are being saved, the jobs will die due to lack of memory or will not even start running.
Which are the differences between $HOME, $STORE and $LUSTRE directories? What is the function of each of them?
All users have access to these directories. They differ in the storage space and they have two types of limits: per GB and per number of files. Both must be taken into account when a quota is exceeded.
The $HOME directory is intended for code file storage and has low speed access. The limits are 10GB and/or 100,000 files. It is the only directory that has backup.
The $STORE directory is similar to the $HOME but has much more storage space, so its main utility is to save the results of the simulations or work carried out. It has a capacity of 500GB and/or 300,000 files and the speed access is also low. There is no backup of this directory.
The $LUSTRE directory is the largest one with a storage space of 3TB with a limit of 200,000 files. It also has the fastest speed access. This directory is the one indicated for performing simulations and it also serves as a storage directory if required due to its large capacity. There is no backup of this directory.
I can’t create any Remote Desktops from User Portal or command line
This problem is probably due to a quota limit. You should check your $HOME quota, If you are exceeded the quota limits or you are close to trespassing it, you won’t be allowed to create the Remote Desktop. To avoid this problem, free some space at your $HOME directory by deleting data you don’t need or moving it to your $STORE or $LUSTRE directories. For storage and quota information visit: Permanent Storage.