You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've started doing a little bit of performance analysis of Cholla on Frontier. I figure that that is something most of us will be doing at some point soon so I wanted to collect our thoughts, experience, guides, etc here so that we can all share what we learn and how we learn it. I think eventually this should become a wiki page but I don't think we have enough for that yet.
HPCToolkit
HPCToolkit is focused on overall application profiling and doesn't have info on GPU kernels (because AMD won't give them an API)
I ended up in contact with Wileam Phan, one of the HPCToolkit devs, via US-RSE and so I tried out running some profiling on Cholla. Here's the steps to do that.
Load the proper modules. Note that you should load the latest version of HPC Toolkit even if it is the dev version since (at the time of writing) it has some fixes for AMD that aren't in the main version yet.
module load ums ums023 then use module avail hpctoolkit and load the latest version
Compile Cholla. Make sure the -g flag is set. It isn't currently set by default for the optimized builds so go edit make.host.frontier and add it to the flags
Gather the profiling info by running Cholla under HPCRun. This will make a measurement_dir named something like hpctoolkit-cholla.mhd.frontier-measurements
hpcrun -e gpu=amd -t ./cholla.mhd.frontier <parameter_file>. To run with MPI just add the launcher stuff before hpcrun.
Setup cache for hpcstruct. At this point we have a lot of raw data in a non-human readable format so it's time to process it with some of the included tools. Setting up a cache for hpcstruct can improve execution time which is good because it takes awhile
Run HPCStruct. This uses half the available threads by default. You can change that with the -j flag. This takes awhile, for large jobs you might consider running this as its own job rather than on a header node. It took about 12 minutes for a single rank run of Cholla using up to 64 threads.
hpcstruct <measurement_dir>
Run HPCProf. This take in the data from the previous steps and make something we can examine with HPCViewer. This might also need to be run in its own job or with HPCStruct in a job. See the man page for details as you need a different version of the command for that
Download the output directory and open it using HPCViewer on your laptop/workstation. Install instructions here. You do need Java installed, they recommend OpenJDK or Adoptium. I installed OpenJDK easily with HomeBrew.
Omniprof is good for single GPU and kernel level profiling
$ module load omniprof
# This takes awhile to run since it runs Cholla many times.
# Pick a problem with the expected number of cells/GPU but maybe limit it to like 200 time steps.
# Running on the standard Brio & Wu Shock Tube at 256^3 took 18 minutes (each test is 347 time steps)
$ omniperf profile -n cholla_prof --device 0 -- ./cholla.mhd.frontier <parameter_file>
# You can can generate the roofline plot only by adding the `--roof-only --kernel-names` argument
# Text analysis with
$ omniperf analyze -p workloads/cholla_prof/mi200/
# GUI analysis coming when I figure it out
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I've started doing a little bit of performance analysis of Cholla on Frontier. I figure that that is something most of us will be doing at some point soon so I wanted to collect our thoughts, experience, guides, etc here so that we can all share what we learn and how we learn it. I think eventually this should become a wiki page but I don't think we have enough for that yet.
HPCToolkit
HPCToolkit is focused on overall application profiling and doesn't have info on GPU kernels (because AMD won't give them an API)
I ended up in contact with Wileam Phan, one of the HPCToolkit devs, via US-RSE and so I tried out running some profiling on Cholla. Here's the steps to do that.
module load ums ums023
then usemodule avail hpctoolkit
and load the latest version-g
flag is set. It isn't currently set by default for the optimized builds so go editmake.host.frontier
and add it to the flagshpctoolkit-cholla.mhd.frontier-measurements
hpcrun -e gpu=amd -t ./cholla.mhd.frontier <parameter_file>
. To run with MPI just add the launcher stuff beforehpcrun
.hpcstruct
. At this point we have a lot of raw data in a non-human readable format so it's time to process it with some of the included tools. Setting up a cache forhpcstruct
can improve execution time which is good because it takes awhilemkdir hpc_cache; export HPCTOOLKIT_HPCSTRUCT_CACHE=hpc_cache
-j
flag. This takes awhile, for large jobs you might consider running this as its own job rather than on a header node. It took about 12 minutes for a single rank run of Cholla using up to 64 threads.hpcstruct <measurement_dir>
hpcprof <measurement_dir> -o output_dir -j <num_threads>
Other Resources
Omniprof
Omniprof is good for single GPU and kernel level profiling
Other Resources
Beta Was this translation helpful? Give feedback.
All reactions