Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

⚠️ 🔧 Jean-Zay planned February 5 partial stoppage and possible work-arounds 🔧 ⚠️ #97

Open
lesteve opened this issue Jan 15, 2024 · 0 comments

Comments

@lesteve
Copy link
Member

lesteve commented Jan 15, 2024

You probably have received an email on January 11 that says that a significant part of Jean-Zay is going to get partially shut-down on February 5 2024, in order to install new GPU nodes.

For now, here is the summary of my understanding of the situation, I'll update it when I get the time, feel free to comment below (or edit if you have the rights).

Summary of the situation

  • 45% of the V100 GPUs are going to be stopped on February 5 and will be removed from the cluster to make room for the new H100 nodes
  • timing is not clear, but the hope is that the new H100 nodes may be available for users beginning of September. Take this with a huge grain of salt, this is completely from informal feed-back, and no communication has been done on this from IDRIS, who operates Jean-Zay

Adastra work-around

If you feel motivated/proactive/curious/bored, one possible work-around would be to apply to Adastra, I created a pad so that anyone can add relevant information there. At the time of writing (mid-January), this is pretty much anyone's guess whether this will be worth it or not for your particular use case ...

Here is a few things to know about Adastra:

  • Adastra has AMD GPUs (not NVIDIA) but I suspect most people around us don't care since they use Pytorch or similar frameworks that work on AMD GPUs. I am trying to gather feed-back from users that have used AMD GPUs, if you have some, comment in this issue!
  • the simplest is to apply for a dynamic access ("accès dynamique"). The procedure is likely similar to the Jean-Zay one but may have some small differences. In principle you can get access in a few days. In practice we will see ...
  • once you manage to have access you need to set things up again (datasets, conda environments, ssh set-up, etc ...). If you use modules (module load), Adastra likely has some but modules they are probably slightly different than Jean-Zay
  • Adastra has lingering software and hardware issues since it started, but maybe this is more specific to HPC than AI use cases? This partly explain why Adastra is currently underused and Adastra is recommended as fall-back option.
  • the Adastra support team is already quite busy working hard to fix these issues. There is a question on how they will handle the load if plenty of users try to migrate to Adastra in a short amount of time
@lesteve lesteve pinned this issue Jan 17, 2024
@lesteve lesteve changed the title Jean-Zay planned February 5 partial stoppage and possible work-arounds ⚠️ 🔧 Jean-Zay planned February 5 partial stoppage and possible work-arounds 🔧 ⚠️ Jan 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant