-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug in distribution of processors at ECMWF atos #1212
Comments
Hi @tsemmler05, If I am understanding Paul's message correctly, what he has done is to disable all the If you are okay with doing essentially what he did, you can simply set all If it does, then we can try to see what's the problem with the Can you share with as a path to those files edited by Paul, and also to one of the really slow runs/simulations? @JanStreffing can you please have a look at the message by Paul Dando, to see if you are understanding the same I am understanding? |
I can confirm that Paul's method only works for OpenMP threads = 1. Regarding sharing the superslow simulation, I have to regenerate it because of ECMWF policy to remove data from /scratch after one month. |
Actually, without carrying out the super slow simulation, I generated two directories to compare: /scratch/duts/runtime/awicm3-v3.1/restlev960/run_21010101-21010103/scripts /scratch/duts/runtime/awicm3-v3.1/restsuperslow/run_21010101-21010103/scripts The first one is using Paul Dando's method, the second one the original method with taskset. You can compare the run scripts in the given directories as well as the prog_*.sh scripts in the corresponding work directories. |
Question is if one can fix the taskset problem in case it proves to be difficult to use hetjob on ECMWF atos. |
This is the reply from Paul Dando (trying to fix taskset problem): Hi Tido, HPC2020: Submitting a parallel job: srun -N1 -n 64 -c 2 executable1 : -N2 -n 64 -c 4 executable2 |
Hi @tsemmler05,
If this works, I can very easily implemented in ESM-Tools.
This, however, might be a clue about why it is not working with the taskset approach. I will compare what you have in So two things going in parallel, Paul trying to give a better solution that could be easily implemented in ESM-Tools, and me checking if there is something weird in our prog_ scripts in ecmwf-atos in comparison with levante for example. |
O.k., regarding the prog_*.sh scripts you can check in /scratch/duts/runtime/awicm3-v3.1/restsuperslow/run_21010101-21010103/work/; regarding the srun command I guess I could try to manually change and see if I get a good performance; I guess this would be without hetjob and without taskset if I understand this correctly. |
I was playing around echoing if [ -z ${PMI_RANK+x} ]; then PMI_RANK=$PMIX_RANK; fi
(( init = $PMI_RANK )) This is what I did to check this out: salloc --ntasks 64 Create a file #!/bin/bash
echo $PMIX_RANK Then I change the permisions of that file to be executable and run it with srun: $ srun -n 5 deleteme.sh
0
1
2
4
3 |
Hi! I tried out the following run command: time srun -N45 -n 5760 -c 1 fesom : -N10 -n 640 -c 2 oifs : -N1 -n 1 -c 128 rnfma : -N1 -n 4 -c 32 xios.x 2>&1 & But I'm getting the following seg fault: [ac5-1023:1933764:0:1934213] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x145bc20fd520) This is the same error that I got before when trying time srun -l --kill-on-bad-exit=1 --cpu_bind=cores --distribution=cyclic:cyclic --multi-prog hostfile_srun 2>&1 & (compare directories /scratch/duts/runtime/awicm3-frontiers-xios/omp4_1_1/run_20000101-20000101/scripts and /scratch/duts/runtime/awicm3-v3.1/restlev640omp2/run_21010101-21010103/scripts) So maybe there is a general problem at ECMWF when trying to run OpenIFS with OMP_threads unequal 1? |
So, does that mean that I could use the prog_*.sh files as they are generated by esm_tools and that I should only change the srun command (because without changing that I am getting the super slow simulation)? |
Paul gave me this suggestion, not sure about his question about MPI_COMM_WORLD? Hi again Tido, Sorry - our emails crossed ! So the "heterogeneous job" option doesn't work either. Do your executables share the MPI_COMM_WORLD or do each run with a separate MPI_COMM_WORLD ? Maybe you can try to go back to using taskset but with SLURM_PROCID instead of PMI_RANK ? I did also try to use mpirun instead of srun as I think PMI_RANK should be set by mpirun. However, I've not yet been able to get this to work. I'll also try to investigate if there are other options you can try. Best regards Paul |
But how would I use SLURM_PROCID instead of PMI_RANK? Since in the prog_*.sh scripts PMI_RANK is used as a variable and I don't know where in the esm scripts this is coming from. |
To me it means that in principle it should work, but maybe I'm not getting Paul's point. |
I'm pretty sure think the answer is yes
It should work with srun because the line is not only about PMI_RANK, there is an if
For now you can do this manually, if it works then I can implemented in ESM-Tools. One last question @tsemmler05, do you have an example with |
Just to confirm, all executables share one |
I can make such an example - at this stage I only used taskset without cyclic distribution and with cpu_bind=none. I'll let you know the outcome. |
This results again in the super slow simulation - makes sense because of the bug assigning the tasks properly that Paul Dando had spotted: Excerpt from the log file: 1019: /lus/h2resw01/scratch/duts/runtime/awicm3-v3.1/restlevtaskcyclic/run_21010101-21010103/work/./prog_fesom.sh: line 3: ((: init = : syntax error: operand expected (error token is "= ") |
We seem to have made some headway. We found that with the default process management interface for srun on ECMWF Atos ( With |
Interesting. Let me know how it goes and if I need to do any changes in the ESM-Tools source code. I'm still blocking Friday (from 11:30 onwards) in case I need to work on ESM-Tools for this topic |
This change is easy to implement into esm_tools: one just has to put in launcher_flags: "--mpi=pmi2 -l --kill-on-bad-exit=1 --cpu_bind=cores --distribution=cyclic:cyclic" to ecmwf-atos.yaml. However, to run the model with taskset takes twice as long as without taskset (when the number of OMP threads is set to 1; at this stage setting it to something greater than 1 results in a seg fault). Therefore, I am considering to abandon the taskset method for ecmwf-atos, especially since I anyway can't use OMP threads greater than 1 in OpenIFS - the whole motivation to try to get taskset running. In that case, if Paul Dando doesn't come up with any solution to these issues, I might need some help to take out the parts for ECMWF atos in prog_*.sh that do the processor redistribution, and also to take out the corresponding parts from the generated run script. @mandresm: thanks for blocking Friday from 11:30 for getting this to work. |
Okay, then I'll contact you via Webex at 11:30 (German time) on Friday to catch up on the status. |
A bug in the esm-tool derived distribution of processors at ECMWF atos has been detected by Paul Dando from ECMWF after I reported extremely slow execution times of AWI-CM 3.1 on ECMWF-atos machine.
I am running esm-tools version 6.37.2.
Paul Dando has asked me to let the machine do the distribution of the processors in doing the following (citation of Paul Dando's support message), and this has led to an execution time comparable to DKRZ levante:
Hi Tido,
I hope you had a good break.
My suspicions were also that there was a restart file in the work directory that was being read when I re-ran and which caused the floating point exception. From what I can see from the output, it ran successfully the first time but then failed when I retried.
The main changes I needed to make were to the prog_fesom.sh, prog_oifs.sh, etc scripts in the work directory. I think there's a bug in these which means the taskset command is trying to run all threads on a single core. In fact, I don't think you need the taskset at all so I changed these prog_*.sh scripts so that they just call, e.g., script_fesom.sh, directly. For example, for prog_fesom.sh, I have:
#!/bin/sh
./script_fesom.sh
and similarly for the other prog_*.sh scripts for oifs, xios and rnfmap. Alternatively, I suppose you could also just change the hostfile_srun file to call the ./script_fesom.sh, etc, scripts directly.
I also removed the part that creates the hostlist file in incoredtim_compute_20000101-20000103.run as you shouldn't need this either. Finally, I changed the srun command to:
time srun -l --kill-on-bad-exit=1 --cpu_bind=cores --distribution=cyclic:cyclic --multi-prog hostfile_srun 2>&1 &
I was then going to play a little more with the cpu-bind and distribution options to see if I could find a better combination.
With this setup, I think you should also be able to set OMP_NUM_THREADS for the OpenIFS executable (although I didn't try this).
Please let me know if this setup also works for you.
Best regards
Paul
The text was updated successfully, but these errors were encountered: