Skip to content

GPU Server_ Frequently Asked Questions (FAQ)

Pedro Louro edited this page Feb 7, 2023 · 1 revision

FAQ

Cannot connect to the server

There are several factors for failing to connect to the server, mainly, where and how you are trying to establish the connection.

Not at DEI

Most of the times that a connection fails is when away from the department. Here are some things to check:

  • Ensure that the DEI VPN is working correctly:
    • Check your credentials;
    • Check the certificate.
  • Check if the internet connection is stable;
  • Should an attempt at connecting through RDP fail, try through SSH;
    • This does not mean that the server is having troubles, it may only mean that the internet connection is not strong enough to provide connection through RDP.
  • If all else fails, check with the other server users if there are any connection problems from their end and proceed accordingly:
    • If a connection is impossible from any user, ask to hard reset the server;
    • Else, get in contact with the system administrator to check if everything is okay with your user.

If you tried all the above points and if it is at all possible, try establishing a connection at DEI and if necessary follow the points in “The server is becoming slow or unresponsive” section.

At DEI

Most of the points from the previous section apply here except the details on the DEI VPN, since there is no need to connect to it when at the department. Check for:

  • Stable internet connection;
  • Establishing SSH connection if RDP is unsuccessful;
  • Check user status with your system administrator;
  • Follow “The server is becoming slow or unresponsive” section.

Most of the times, if a connection to the server is not possible when at DEI, it is more than likely that a hard reset is needed.

The server is becoming slow or unresponsive

When running an experiment on the server there is always the possibility of something going wrong, do to either a mistake on the code or some of your colleagues didn’t check the available resources before running their experiments. This will happen many times and, to avoid hard resetting the server, there are some actions you can take. So, if the server seems to be slowing down or becoming unresponsive, do the following:

  • If the Remote Desktop session is not responding, immediately try connecting through;
  • Check the running processes using your preferred process manager; btop is advised due to its filtering capabilities and is used for the rest of the example;
  • Using btop, simply filter by RAM and CPU usage to find the culprit;
  • If the process is yours, kill it and assess the unexpected high resource use;
  • Should the process be from another person, it is advised to first notice the user in the appropriate channel, WELMO Server on Skype at the moment of writing, and then kill it; In any case, the process should be killed as soon as possible, or the server will need a manual reset.

There are times that no particular process is overwhelming the server, there are just to many people taking up resources. In this case, communication through the proper channel is advised to promote fair usage between users.

Known Problems

None.

Previously one GPU (A5000) was not registering due to an hardware fault and was replaced (RMA). The output for future reference:

One of the GPUs is not registering (nvidia-bug-report.sh):

[   12.251296] kernel: NVRM: GPU 0000:3d:00.0: rm_init_adapter failed, device minor number 2
[   12.251941] kernel: [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00003d00] Failed to allocate NvKmsKapiDevice
[   12.252562] kernel: [drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00003d00] Failed to register device

(...)

[    3.528019] pci 0000:5f:00.0: BAR 15: no space for [mem size 0x03800000 64bit pref]
[    3.528022] pci 0000:5f:00.0: BAR 15: failed to assign [mem size 0x03800000 64bit pref]
[    3.528026] pci 0000:60:03.0: BAR 15: no space for [mem size 0x03800000 64bit pref]
[    3.528029] pci 0000:60:03.0: BAR 15: failed to assign [mem size 0x03800000 64bit pref]
[    3.528032] pci 0000:60:03.0: BAR 14: assigned [mem 0xc5200000-0xc52fffff]
[    3.528036] pci 0000:61:00.0: BAR 0: no space for [mem size 0x01000000 64bit pref]
[    3.528039] pci 0000:61:00.0: BAR 0: failed to assign [mem size 0x01000000 64bit pref]