Cholla Profiling and Performance Tuning on Frontier #307

bcaddy · 2023-07-03T15:59:48Z

bcaddy
Jul 3, 2023
Maintainer

I've started doing a little bit of performance analysis of Cholla on Frontier. I figure that that is something most of us will be doing at some point soon so I wanted to collect our thoughts, experience, guides, etc here so that we can all share what we learn and how we learn it. I think eventually this should become a wiki page but I don't think we have enough for that yet.

HPCToolkit

HPCToolkit is focused on overall application profiling and doesn't have info on GPU kernels (because AMD won't give them an API)

I ended up in contact with Wileam Phan, one of the HPCToolkit devs, via US-RSE and so I tried out running some profiling on Cholla. Here's the steps to do that.

Load the proper modules. Note that you should load the latest version of HPC Toolkit even if it is the dev version since (at the time of writing) it has some fixes for AMD that aren't in the main version yet.
- module load ums ums023 then use module avail hpctoolkit and load the latest version
Compile Cholla. Make sure the -g flag is set. It isn't currently set by default for the optimized builds so go edit make.host.frontier and add it to the flags
Gather the profiling info by running Cholla under HPCRun. This will make a measurement_dir named something like hpctoolkit-cholla.mhd.frontier-measurements
- hpcrun -e gpu=amd -t ./cholla.mhd.frontier <parameter_file>. To run with MPI just add the launcher stuff before hpcrun.
Setup cache for hpcstruct. At this point we have a lot of raw data in a non-human readable format so it's time to process it with some of the included tools. Setting up a cache for hpcstruct can improve execution time which is good because it takes awhile
- mkdir hpc_cache; export HPCTOOLKIT_HPCSTRUCT_CACHE=hpc_cache
Run HPCStruct. This uses half the available threads by default. You can change that with the -j flag. This takes awhile, for large jobs you might consider running this as its own job rather than on a header node. It took about 12 minutes for a single rank run of Cholla using up to 64 threads.
- hpcstruct <measurement_dir>
Run HPCProf. This take in the data from the previous steps and make something we can examine with HPCViewer. This might also need to be run in its own job or with HPCStruct in a job. See the man page for details as you need a different version of the command for that
- hpcprof <measurement_dir> -o output_dir -j <num_threads>
Download the output directory and open it using HPCViewer on your laptop/workstation. Install instructions here. You do need Java installed, they recommend OpenJDK or Adoptium. I installed OpenJDK easily with HomeBrew.
Profit??

Other Resources

HPCToolkit Cheat Sheet

Omniprof

Omniprof is good for single GPU and kernel level profiling

$ module load omniprof

# This takes awhile to run since it runs Cholla many times. 
# Pick a problem with the expected number of cells/GPU but maybe limit it to like 200 time steps. 
# Running on the standard Brio & Wu Shock Tube at 256^3 took 18 minutes (each test is 347 time steps)
$ omniperf profile -n cholla_prof --device 0 -- ./cholla.mhd.frontier <parameter_file>

# You can can generate the roofline plot only by adding the `--roof-only --kernel-names` argument 

# Text analysis with
$ omniperf analyze -p workloads/cholla_prof/mi200/

# GUI analysis coming when I figure it out

Other Resources

Omniperf Docs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cholla Profiling and Performance Tuning on Frontier #307

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Cholla Profiling and Performance Tuning on Frontier #307

bcaddy Jul 3, 2023 Maintainer

HPCToolkit

Other Resources

Omniprof

Other Resources

Replies: 0 comments

bcaddy
Jul 3, 2023
Maintainer