F.A.Q.

In the User Portal > Information > Links you can find the FinisTerrae III workshops that will help you understand the system and its use. It’s strongly recommended to read them before connecting.

Connection

Workaround for Windows SSH and VS Code Users

Some people who use Windows computers to ssh to our machines from a Windows command prompt, powershell, or via Visual Studio Code’s SSH extension have received a new error message about a “Corrupted MAC on input” or “message authentication code incorrect”. This error is due to an outdated OpenSSL library included in Windows and a recent security-mandated change to ssh on our machines. However, there is a functional workaround for this issue. (Note: If you are not experiencing the above error, you do not need and should not use the following workaround.)

For VS Code SSH extension users, you will need to create an ssh config file on your local computer or modify it (~/.ssh/config), with a host entry for our machines that specifies a new message authentication code. For example, for FinisTerrae III the content to add to this file could be somethink like this:

Host ft3.cesga.es
  HostName ft3.cesga.es
  MACs hmac-sha2-512

Here ~ means C:\Users\user_that_you_have_in_your_computer. The configuration file will also apply to command-line ssh in Windows, as well.

For command-line and Powershell ssh users, adding -m hmac-sha2-512 to your ssh command is another solution. For example: ssh -m hmac-sha2-512 <username>@ft3.cesga.es. If you are using the scp command, you have to add -o MACs=hmac-sha2-512.

How can I connect to FinisTerrae III?

There are different ways to be able to connect with the server and use it. From the User portal > Tools you can access to an SSH Terminal and a Remote desktop but it must be taken into account that Remote Desktops have a maximum time duration of 24 hours (if the session is not renewed) and the maximum memory allowed is 32GB. The advantage of remote desktops is that, due to their web interface, they are much easier to use than a command terminal.

The other ways to connect and use FinisTerrae III are through interactive sessions with the compute command and via SSH using the batch system.

More information about system use: System Use.

I cannot access to any CESGA server.

If this is happening outside your research center or university and you can connect without any problem from them, it is due to VPN. In order to use the CESGA resources outside these authorized centers you must use the VPN. It is mandatory to use Checkpoint for any operation system and follow the steps explained in How to connect to install and configure the connection.

What are the credentials used for authentication in Checkpoint?

These credentials are the same ones used to access FinisTerrae III or other services offered by CESGA. That is, it is the username that was granted when registering for CESGA services.

Warning

DO NOT ENTER THE EMAIL OR DOMAIN @FT3.CESGA.ES. Only enter the username, everything that precedes the @ symbol. For example, if you use user_cesga@ft3.cesga.es to connect to FinisTerrae III, the username that should be entered in the CheckPoint credentials is user_cesga.

Use of FinisTerrae III

How can FinisTerrae III be used?

You can access FinisTerrae III through interactive sessions using the compute command. It must be taken into account that the resources of this type of sessions are limited to 64 cores, 247GB of memory per core and a maximum of 8 hours.

Another option is to use the batch system, indicated for all those cases in which the previous option do not satisfy the needs to carry out the simulations. This option allows you to get the most out of FinisTerrae III, allowing the use of multiple nodes and cores (options for parallelizing jobs), using large amounts of memory, running jobs for a longer time and making use of GPU accelerators in desired.

How to know which applications and modules are installed? How do I load a module?

To know the modules and applications installed in FinisTerrae III, we recommed to use the command module spider. It shows a list and a brief description of the modules. To search a specific module: module spider <module_name>. If there are different versions of that module, it would appear listed. For an extended description of the module, use and load, you will have to specify the version.

To load a module, you must use module load <module_name> and version if required.

With the command equery you can also see other installed modules. To see them all, just run the command equery l "*". For example, If you want to search for zlib: equery l zlib.

If the module or application you want to use is not installed, you must contact the Applications department and make an installation request.

More information at: Environment modules.

Is it possible to connect to the nodes where I have a job running?

It is possible to ssh to the node where the job is running. The list of nodes on which the job is running can be obtained with the squeue command. For example, if the job is running on ilk-20, you would have to do ssh ilk-20. Next in line, it will ask you for your password and once it’s authentified, you will be connected to that node. It must be taken into account that it only allowes the connection to nodes where the user is executing a job and, once the job has finished, it will kick the user out returning to one of the login nodes.

Why do I get errors or warnings when I try to launch a job?

There can be many reasons why errors or warnings may arise but the most common are:

  1. It is mandatory to indicate the memory and the time that the job will need with the parameters --mem= or --mem-per-cpu= and --time=D-HH:MM:SS or -t D-HH:MM:SS. If one of these parameters are not entered, the error messages will be as follows:

    • sbatch: error: Batch job submission failed: Time limit specification required, but not provided

    • sbatch: error: slurm_job_submit: Neither –mem nor –mem-per-cpu specified

  2. Warnings associated with User Account Limits. They appear in the Nodelist(Reason) when a job is launched. It should be emphasized that it is NOT an error, when the resources that have been requested when launching the job are available, it will start executing automatically. Some reasons that usually appear and their explanations are the following:

    • Priority: As explained above, the SLURM system relies on partitions, QOS limits and priorities to set the order in which jobs are executed. When this message appears in the Nodelist(Reason) it is usually because there is a high demand for resources and it will take some time for them to be released.

    • Dependency: appears when the start of one job is contingent on the completion of another.

    • Resources: the system is waiting for the requested resources to be released before it can start running the job.

    • QOSMaxJobsPerUserLimit: indicates that the maximum number of jobs that a user can launch simultaneously has been reached.

    • QOSMaxCpuPerUserLimit: indicates that the maximum number of CPUs that a user can use and request simultaneously has been reached.

    • AssocGrpCpuLimit: the maximum number of CPUs granted to a group or association to which the user belongs is being used.

    • AssocGrpJobsLimit: same explanation as in the previous case but referring to the job limit.

    • AssocGrpGRES: similar to the two previous cases but for the GPU limit.

Therefore, it must be taken into account that there are restrictions at the user level but also at the global level of the group to which they belong to. You can use the command batchlim to obtain the partitions and QOS limits for your account.

Why is my job not running? How much time will I have to wait? What can I do to reduce the waiting times?

Most common reasons because a job is still pending could be:

  • AssociationJobLimit: The job’s association has reached its maximum job count

  • AssociationResourceLimit: The job’s association has reached some resource limit

  • AssociationTimeLimit: The job’s association has reached its time limit

  • BadConstraints: The job’s constraints can not be satisfied

  • Dependency: This job is waiting for a dependent job to complete

  • JobHeldAdmin: The job is held by a system administrator

  • JobHeldUser: The job is held by the user

  • Priority: One or more higher priority jobs exist for this partition

  • QOSResourceLimit: The job’s QOS has reached some resource limit. (QOSMaxJobsPerUserLimit)

  • Resources: The job is waiting for resources to become available

Also, you can use the command squeue --start <jobid> to know when your job will be starting. Sometimes the system can’t provide this information because it doesn’t know yet when the resources would be available.

To reduce the time a job is waiting to start its execution, you should adapt the real requirements of resources of the job. If you request a lot more cores, memory or time than the job really needs, you will have to wait way more until the amount of resources are available.

How can I know how much memory are using my jobs? or How can I adjust the memory request for job submission?

Sometimes it can be very difficult to determine how much memory a job will require. For this purpose, we recommend that you conduct tests with different amounts of memory. You can use the seff <jobid> command to check the actual memory used by your job and adjust your memory request accordingly. The output of this command is approximate, so it should not be taken literally but rather as an approximation:

$ seff 2106200
Job ID: 2106200
Cluster: finisterrae3
User/Group:
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 1-22:39:55
C  PU Efficiency: 99.64% of 1-22:50:01 core-walltime
Job Wall-clock time: 1-22:50:01
Memory Utilized: 129.04 MB
Memory Efficiency: 4.20% of 3.00 GB

For this example, the job request of 3GB of RAM memory seems too high, so it should be adjusted.

I can’t list or load modules using “module” or “module spider”

We are having some issues listing and loading modules using spider. The error message is: “/usr/bin/lua: /usr/share/lmod/lmod/libexec/FrameStk.lua:125: attempt to index a boolean value (local ‘mname’)”. To fix this you can try loading the module with the following command: module --ignore-cache spider. If this does not resolve the issue, you may need to delete the .lmod directory located in your $HOME: rm -rf $HOME/.lmod.d

Storage

How much storage space is being used?

You have to use the myquota command which will show you the information of all the directories ($HOME, $STORE and $LUSTRE). If you are in the $LUSTRE directory, you can also use lfs quota command. Note that if there is not enough space in the directory where the job results are being saved, the jobs will die due to lack of memory or will not even start running.

Which are the differences between $HOME, $STORE and $LUSTRE directories? What is the function of each of them?

All users have access to these directories. They differ in the storage space and they have two types of limits: per GB and per number of files. Both must be taken into account when a quota is exceeded.

  • The $HOME directory is intended for code file storage and has low speed access. The limits are 10GB and/or 100,000 files. It is the only directory that has backup.

  • The $STORE directory is similar to the $HOME but has much more storage space, so its main utility is to save the results of the simulations or work carried out. It has a capacity of 500GB and/or 300,000 files and the speed access is also low. There is no backup of this directory.

  • The $LUSTRE directory is the largest one with a storage space of 3TB with a limit of 200,000 files. It also has the fastest speed access. This directory is the one indicated for performing simulations and it also serves as a storage directory if required due to its large capacity. There is no backup of this directory.

Remote Desktops

I can’t create any Remote Desktops from User Portal or command line

This problem is probably due to a quota limit. You should check your $HOME quota, If you are exceeded the quota limits or you are close to trespassing it, you won’t be allowed to create the Remote Desktop. To avoid this problem, free some space at your $HOME directory by deleting data you don’t need or moving it to your $STORE or $LUSTRE directories. For storage and quota information visit: Permanent Storage.