-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Turn off xpmem in OFED 5.8 on Chrysalis #6359
Conversation
@amametjanov Can you check that this doesn't slow runs down? It did not on my test with an ne30 production coupled case. |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked with 2 tests in
create_test e3sm_prod_bench
PFS.ne30pg2_r05_IcoswISC30E3r5.F2010.chrysalis_intel.bench-noio:
2024-04-18 22:53:17: MEMCOMP: Memory usage highwater changed by -0.16%: baseline=6373.210 MB, tolerance=5%, current=6362.930 MB
---------------------------------------------------
2024-04-18 22:53:17: TPUTCOMP: Throughput changed by 0.50%: baseline=1.791 sypd, tolerance=5%, current=1.782 sypd
PFS.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.chrysalis_intel.bench-noio:
2024-04-18 22:52:00: MEMCOMP: Memory usage highwater changed by -0.45%: baseline=4902.090 MB, tolerance=5%, current=4879.850 MB
---------------------------------------------------
2024-04-18 22:52:00: TPUTCOMP: Throughput changed by 0.90%: baseline=1.997 sypd, tolerance=5%, current=1.979 sypd
<1% throughput tradeoff for <1% memory.
Turn off xpmem in OpenMPI Add env var to turn off xpmem when using OpenMPI. Avoids leaving nodes in unkillable state.
b757987
to
71db923
Compare
@amametjanov please try again with this new version that doesn't have the typo. |
PFS.ne30pg2_r05_IcoswISC30E3r5.F2010.chrysalis_intel.bench-noio:
PFS.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.chrysalis_intel.bench-noio:
|
This is now in the openmpi module by default so don't need to add it. |
Removed it from module. Was in place 2pm to 10pm April 22. |
Add env var to chrysalis to turn off xpmem when using OpenMPI. Avoids leaving nodes in unkillable state. Workaround for bug in xpmem. [BFB]
revised title and comment because this variable is needed all the time, not just OpenMPI. |
Add env var to chrysalis to turn off xpmem when using the new OFED 5.8 network drivers.
A bug in xpmem can leave nodes stuck in an unkillable state after a model crash.
[BFB]