-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/quda interface optimize #15
Merged
detar
merged 25 commits into
milc-qcd:develop
from
lattice:feature/quda-interface-optimize
Oct 4, 2017
Merged
Feature/quda interface optimize #15
detar
merged 25 commits into
milc-qcd:develop
from
lattice:feature/quda-interface-optimize
Oct 4, 2017
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
QUDA 0.9 needs "-lcuda" as well as "-lcudart".
Updated Makefile for QUDA 0.9
Merge in latest MILC master
Merge in latest MILC develop branch
Develop merge
Gauge force typo
…d for extracting and inserting to/from the site array
…nctions introduced in 88e0340
…llback to CPU if Shcroedinger functional is set
…tracting the gauge-field matrices
…arrays when USE_FL_GPU is enabled
… and remove fermion_links_hisq_load_gpu.c. Fix flop counter when USE_FL_GPU is enabled.
…sary overhead in the fermion-link construction
…e same fn_last cache to avoid unnecessary invalidation when switching between CG and multi-shift CG solvers
Hi Kate,
I have approved this request. I'm impressed with all the work you have
done! I will want to do some testing now.
Best,
Carleton
…On 10/3/17 6:03 PM, maddyscientist wrote:
This pull request represents a considerable optimization of the
MILC-QUDA interface when running RHMC HISQ fermions, as well as
ancillary cleanup and minor fixes:
* Offload of the reunitarization calculation to QUDA
* Optimization of the gauge / momentum extraction from and packing
into the lattice struct, and abstracted this into helper functions
in generic_quda.h
* Use pinned memory allocations for auxiliary and fermion link-field
allocations, this improves PCIe copy efficiency
* Do not compute the "double-store dslash" backwards fat and long
links when using QUDA. The overhead of creating the fat and long
link backward is a large unnecessary overhead (accounts for up to
75% of the total fat/long link computation) (Hence do not use the
double-store dslash when compiling with QUDA)
* Offload regular dslash calls to QUDA - even with PCIe transfer
overhead, this provides a considerable speedup over the standard
CPU dslash
* Introduce |get_fn_old| and |set_fn_old| functions so that CG,
multi-shift CG and dslash offload can use the same invalidation
flag (reduces number of gauge field downloads)
* Fix flop counter for fat-link computation when GPU offload is enabled
* Some ancillary code clean up regarding the QUDA interface
The effect of these optimizations is to considerably reduce the time
spent in data management between QUDA and MILC, as well as reducing
the amount of time spent in unaccelerated CPU code. As an example,
running a quick RHMD benchmark on my workstation, running from a cold
start with few CG iterations:
* Lattice size 16x16x16x32
* Double precision
* 10 trajectories, each of 10 MD steps
* 6-core Haswell, with 2xP100 GPUs
* 2 processes, each with 3 OMP threads
Time Metric Old (s) New (s)
Total time 367 238
Total QUDA 177 176
Fat-link time (QUDA) 11.1 8.2
Fat-link time (MILC) 27.2 10.0
CG time (QUDA) 38 37
CG time (MILC) 39 37
Fermion force (QUDA) 109 108
Fermion force (MILC) 112 111
Gauge force (QUDA) 10.1 8.3
Gauge force (MILC) 15.1 14.2
Gauge update (QUDA) 7.8 7.8
Gauge update (MILC) 83.1 22.8
We can see that the main improvement is in the reduction between the
reported QUDA and MILC times. The difference that remains is due to
cache-unfriendly lattice structure in MILC upon which we have to
extract and insert the gauge and momentum fields. This overhead is
typically 3x greater than the PCIe transfer overhead.
This pull request should be done after #14
<#14> has been merged, since
it is based from that. Moreover it requires the changes to QUDA in
lattice/quda#646 <lattice/quda#646>
(feature/gauge-comms-cleanup) are merged in.
------------------------------------------------------------------------
You can view, comment on, or merge this pull request online at:
#15
Commit Summary
* Merge branch 'feature/no_cpu_refine' into develop
* Updated Makefile for QUDA 0.9
* Merge pull request #7 from maddyscientist/patch-1
* Merge pull request #2 from milc-qcd/master
* Merge branch 'master' into develop
* Merge pull request #3 from milc-qcd/develop
* Merge pull request #4 from milc-qcd/develop
* Merge pull request #5 from milc-qcd/develop
* Merge pull request #6 from milc-qcd/develop
* Enable -Wall (with some opt outs), and fix all generated warnings
* Fix couple of things in last commit
* Add QUDA specific routines for creating gauge and momentum fields,
and for extracting and inserting to/from the site array
* Cleanup of QUDA gauge force and link update routines to use helper
functions introduced in 88e0340
* Offload reunitarization to QUDA if gauge-force offload is enabled.
Fallback to CPU if Shcroedinger functional is set
* Remove legacy use of use_pinned_memory member in QudaFatLinkArgs_t
* Use pinned memory for ASQTAD GPU outer product vectors
* When creating/restoring fermion links on GPU, use QUDA helpers for
extracting the gauge-field matrices
* Use pinned memory allocation for the auxilary and fermion
link-field arrays when USE_FL_GPU is enabled
* Cleanup: put load_hisq_aux_links_gpu into
fermion_links_fn_load_gpu.c and remove
fermion_links_hisq_load_gpu.c. Fix flop counter when USE_FL_GPU is
enabled.
* Fix bug in gauge force introduced in prior cleanup in
d7ed3b4
* Minor cleanup of generic_quda.h
* Fix warning in dslash_fn.c
* If using QUDA, do not use double-store dslash since this adds
unnecessary overhead in the fermion-link construction
* QUDA-accelerated multi-shift and regular CG HISQ solvers now share
the same fn_last cache to avoid unnecessary invalidation when
switching between CG and multi-shift CG solvers
* When compiling with HISQ CG GPU offload, also offload all calls to
dslash to QUDA
File Changes
* *M* Make_template_combos
<https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-0> (7)
* *M* Makefile
<https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-1> (13)
* *M* generic/gauge_force_imp_gpu.c
<https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-2> (43)
* *M* generic/gauge_stuff.c
<https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-3> (20)
* *M* generic/io_helpers.c
<https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-4> (6)
* *M* generic/io_lat_utils.c
<https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-5> (10)
* *M* generic/reunitarize2.c
<https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-6> (54)
* *M* generic_ks/Make_template
<https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-7> (3)
* *M* generic_ks/d_congrad5_fn_gpu.c
<https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-8> (11)
* *M* generic_ks/d_congrad5_fn_milc.c
<https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-9> (1)
* *M* generic_ks/dslash_fn.c
<https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-10> (51)
* *M* generic_ks/fermion_force_asqtad_gpu.c
<https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-11> (11)
* *M* generic_ks/fermion_links_fn_load_gpu.c
<https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-12> (74)
* *M* generic_ks/fermion_links_fn_load_milc.c
<https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-13> (10)
* *M* generic_ks/fermion_links_from_site.c
<https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-14> (28)
* *D* generic_ks/fermion_links_hisq_load_gpu.c
<https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-15> (55)
* *M* generic_ks/fermion_links_hisq_load_milc.c
<https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-16> (9)
* *M* generic_ks/fermion_links_hisq_milc.c
<https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-17> (3)
* *M* generic_ks/fn_links_milc.c
<https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-18> (4)
* *M* generic_ks/ks_multicg_offset_gpu.c
<https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-19> (21)
* *M* include/generic_quda.h
<https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-20> (124)
* *M* ks_imp_rhmc/update_u.c
<https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-21> (33)
Patch Links:
* https://github.com/milc-qcd/milc_qcd/pull/15.patch
* https://github.com/milc-qcd/milc_qcd/pull/15.diff
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#15>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AF_j3oLBC3kQQTFBwR4NqqOj-RfIuIGlks5sostXgaJpZM4Ps77S>.
|
detar
pushed a commit
that referenced
this pull request
Apr 18, 2019
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request represents a considerable optimization of the MILC-QUDA interface when running RHMC HISQ fermions, as well as ancillary cleanup and minor fixes:
get_fn_old
andset_fn_old
functions so that CG, multi-shift CG and dslash offload can use the same invalidation flag (reduces number of gauge field downloads)The effect of these optimizations is to considerably reduce the time spent in data management between QUDA and MILC, as well as reducing the amount of time spent in unaccelerated CPU code. As an example, running a quick RHMD benchmark on my workstation, running from a cold start with few CG iterations:
We can see that the main improvement is in the reduction between the reported QUDA and MILC times. The difference that remains is due to cache-unfriendly lattice structure in MILC upon which we have to extract and insert the gauge and momentum fields. This overhead is typically 3x greater than the PCIe transfer overhead.
This pull request should be done after #14 has been merged, since it is based from that. Moreover it requires the changes to QUDA in lattice/quda#646 (feature/gauge-comms-cleanup) are merged in.