-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Pedro Louro edited this page May 19, 2023
·
7 revisions
Thanks to FCT and other funding sources, our research group has access to an high-performance GPU/Deep Learning server to run complex computations. The server can be used by researchers, but please note that it is shared between users (MIR/MER and Health Informatics). Follow the defined netiquette rules.
- Do not hoard resources for your tasks (GPUs / Memory / CPU cores)
- General rule is to use 1 GPU per user and a reasonable number of CPU cores
- If you need more, at least warn your colleagues (Skype group) and try to give an estimate (how much and for how long?)
- Design / plan and test your experiments with smaller subsets of data to confirm it works before going all in for weeks
- Not cool to hoard resources for long without at least getting results
- Do not reboot the machine without asking, especially if there are tasks running
- Similarly, be careful when changing software or killing processes
- If editing files remotely (e.g., using remote desktop), make sure you save your work before leaving.
- For long experiments use checkpoints (save/load progress) and logs. This saves time in case your process goes down. Random TensorFlow/Keras example here.
SuperServer 4029GP-TRT2 - 4U Dual Processor (Intel), Single-Root GPU System with Up to 8 PCI-E GPUs. Currently with:
- CPU: 2x Intel® Xeon® Silver 4214 Processor (
lscpu
)- Frequency: 2.20 GHz (Turbo 3.20 GHz)
- Total Cores: 24 (12 per CPU)
- Total Threads: 48 (24 per CPU)
- Cache: 16.5 MB (per CPU)
- RAM: 320 GB DDR4 (
sudo lshw -short -C memory
)- 8 x 32GiB DIMM DDR4 Synchronous 2666 MHz (0.4 ns)
- 2 x 32GiB DIMM DDR4 Synchronous 3200 MHz (0.3 ns)
- GPUs: 8 installed (
lspci | grep -i --color 'vga\|3d\|2d'
andnvidia-smi
)- 5 x NVIDIA RTX A5000 24GB GDDR6
- 8192 CUDA Cores ; 24GB GDDR6 ; 27.8 TFLOPS
- 3 x NVIDIA Quadro P5000 16GB GDDR5X
- 2560 CUDA Cores ; 16 GB GDDR5X ; 8.9 TFLOPS
- 5 x NVIDIA RTX A5000 24GB GDDR6
- HDD: 20+TB (
lsblk -f
,sudo hwinfo --disk --short
,cat /proc/mdstat
)- 2x Intel SSD DC S4500 Series 480GB 2.5in SATA 6Gbs
- RAID 1, mounted at / (OS)
- 6x Seagate 4TB Barracuda 2.5 5400rpm SATA III 128MB (ST4000LM024)
- RAID 5, mounter at /home (user files)
- 2x Intel SSD DC S4500 Series 480GB 2.5in SATA 6Gbs
Several stuff already installed. Should we install multi-user stuff or each user will install as needed?
- NVIDIA-SMI 520.61.05 / Driver Version: 520.61.05 / CUDA Version: 11.8
- MATLAB R2021b Update 1 (9.11.0.1809720) 64-bit (glnxa64)
- Probably will ask you for a personal license, check matlab activation @ helpdesk
- Can be used also via SSH just to run scripts (explain how here?)
- MATLAB R2014b is also installed (
ll /usr/local/bin/matlab_r2014b
)
- Python 3.8