LMOD: Error while loading shared libraries: libpython3.10.so.1.0: cannot open shared object file: No such file or directory #2731
-
At our cluster, we us lmod for modular module usage. So when starting a new job, we need to load existing modules, e.g.: To load PyTorch, Python, and pdsh:
So if I start a multi-node, interactive job to test deepspeed, I have these activated on my main node. When I then run some deepspeed code, I get an error that Python cannot be found.
I believe the reason is that pdsh ssh's into the nodes but of course not automatically loads the correct modules. So Python is not available to the nodes after ssh. I am not sure how I can solve that. Is there a file of commands that I can give to deepspeed/pdsh that it needs to execute right after ssh'ing? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 4 replies
-
This is a tricky one. Can you try adding these commands into your |
Beta Was this translation helpful? Give feedback.
-
Glad to hear of the progress. Yes, I understand the downside of the current approach. I don't there is any environment variable set on login by pdsh/deepspeed. However, you could test for this dumping the env vars in the first very line of your Another, somewhat heavyweight option, is to create a |
Beta Was this translation helpful? Give feedback.
This is a tricky one. Can you try adding these commands into your
.bashrc
?