Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/quda interface optimize #15

Merged
merged 25 commits into from
Oct 4, 2017

Conversation

maddyscientist
Copy link
Contributor

This pull request represents a considerable optimization of the MILC-QUDA interface when running RHMC HISQ fermions, as well as ancillary cleanup and minor fixes:

  • Offload of the reunitarization calculation to QUDA
  • Optimization of the gauge / momentum extraction from and packing into the lattice struct, and abstracted this into helper functions in generic_quda.h
  • Use pinned memory allocations for auxiliary and fermion link-field allocations, this improves PCIe copy efficiency
  • Do not compute the "double-store dslash" backwards fat and long links when using QUDA. The overhead of creating the fat and long link backward is a large unnecessary overhead (accounts for up to 75% of the total fat/long link computation) (Hence do not use the double-store dslash when compiling with QUDA)
  • Offload regular dslash calls to QUDA - even with PCIe transfer overhead, this provides a considerable speedup over the standard CPU dslash
  • Introduce get_fn_old and set_fn_old functions so that CG, multi-shift CG and dslash offload can use the same invalidation flag (reduces number of gauge field downloads)
  • Fix flop counter for fat-link computation when GPU offload is enabled
  • Some ancillary code clean up regarding the QUDA interface

The effect of these optimizations is to considerably reduce the time spent in data management between QUDA and MILC, as well as reducing the amount of time spent in unaccelerated CPU code. As an example, running a quick RHMD benchmark on my workstation, running from a cold start with few CG iterations:

  • Lattice size 16x16x16x32
  • Double precision
  • 10 trajectories, each of 10 MD steps
  • 6-core Haswell, with 2xP100 GPUs
  • 2 processes, each with 3 OMP threads
Time Metric Old (s) New (s)
Total time 367 238
Total QUDA 177 176
Fat-link time (QUDA) 11.1 8.2
Fat-link time (MILC) 27.2 10.0
CG time (QUDA) 38 37
CG time (MILC) 39 37
Fermion force (QUDA) 109 108
Fermion force (MILC) 112 111
Gauge force (QUDA) 10.1 8.3
Gauge force (MILC) 15.1 14.2
Gauge update (QUDA) 7.8 7.8
Gauge update (MILC) 83.1 22.8

We can see that the main improvement is in the reduction between the reported QUDA and MILC times. The difference that remains is due to cache-unfriendly lattice structure in MILC upon which we have to extract and insert the gauge and momentum fields. This overhead is typically 3x greater than the PCIe transfer overhead.

This pull request should be done after #14 has been merged, since it is based from that. Moreover it requires the changes to QUDA in lattice/quda#646 (feature/gauge-comms-cleanup) are merged in.

maddyscientist and others added 25 commits August 13, 2015 16:10
QUDA 0.9 needs "-lcuda" as well as "-lcudart".
Updated Makefile for QUDA 0.9
Merge in latest MILC master
Merge in latest MILC develop branch
…d for extracting and inserting to/from the site array
…llback to CPU if Shcroedinger functional is set
… and remove fermion_links_hisq_load_gpu.c. Fix flop counter when USE_FL_GPU is enabled.
…sary overhead in the fermion-link construction
…e same fn_last cache to avoid unnecessary invalidation when switching between CG and multi-shift CG solvers
@detar detar merged commit f407b32 into milc-qcd:develop Oct 4, 2017
@detar
Copy link
Contributor

detar commented Oct 4, 2017 via email

detar pushed a commit that referenced this pull request Apr 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants