Feature/quda interface optimize #15

maddyscientist · 2017-10-04T00:03:34Z

This pull request represents a considerable optimization of the MILC-QUDA interface when running RHMC HISQ fermions, as well as ancillary cleanup and minor fixes:

Offload of the reunitarization calculation to QUDA
Optimization of the gauge / momentum extraction from and packing into the lattice struct, and abstracted this into helper functions in generic_quda.h
Use pinned memory allocations for auxiliary and fermion link-field allocations, this improves PCIe copy efficiency
Do not compute the "double-store dslash" backwards fat and long links when using QUDA. The overhead of creating the fat and long link backward is a large unnecessary overhead (accounts for up to 75% of the total fat/long link computation) (Hence do not use the double-store dslash when compiling with QUDA)
Offload regular dslash calls to QUDA - even with PCIe transfer overhead, this provides a considerable speedup over the standard CPU dslash
Introduce get_fn_old and set_fn_old functions so that CG, multi-shift CG and dslash offload can use the same invalidation flag (reduces number of gauge field downloads)
Fix flop counter for fat-link computation when GPU offload is enabled
Some ancillary code clean up regarding the QUDA interface

The effect of these optimizations is to considerably reduce the time spent in data management between QUDA and MILC, as well as reducing the amount of time spent in unaccelerated CPU code. As an example, running a quick RHMD benchmark on my workstation, running from a cold start with few CG iterations:

Lattice size 16x16x16x32
Double precision
10 trajectories, each of 10 MD steps
6-core Haswell, with 2xP100 GPUs
2 processes, each with 3 OMP threads

Time Metric	Old (s)	New (s)
Total time	367	238
Total QUDA	177	176
Fat-link time (QUDA)	11.1	8.2
Fat-link time (MILC)	27.2	10.0
CG time (QUDA)	38	37
CG time (MILC)	39	37
Fermion force (QUDA)	109	108
Fermion force (MILC)	112	111
Gauge force (QUDA)	10.1	8.3
Gauge force (MILC)	15.1	14.2
Gauge update (QUDA)	7.8	7.8
Gauge update (MILC)	83.1	22.8

We can see that the main improvement is in the reduction between the reported QUDA and MILC times. The difference that remains is due to cache-unfriendly lattice structure in MILC upon which we have to extract and insert the gauge and momentum fields. This overhead is typically 3x greater than the PCIe transfer overhead.

This pull request should be done after #14 has been merged, since it is based from that. Moreover it requires the changes to QUDA in lattice/quda#646 (feature/gauge-comms-cleanup) are merged in.

QUDA 0.9 needs "-lcuda" as well as "-lcudart".

Updated Makefile for QUDA 0.9

Merge in latest MILC master

Merge in latest MILC develop branch

Develop merge

Gauge force typo

Develop

…d for extracting and inserting to/from the site array

…nctions introduced in 88e0340

…llback to CPU if Shcroedinger functional is set

…tracting the gauge-field matrices

…arrays when USE_FL_GPU is enabled

… and remove fermion_links_hisq_load_gpu.c. Fix flop counter when USE_FL_GPU is enabled.

…sary overhead in the fermion-link construction

…e same fn_last cache to avoid unnecessary invalidation when switching between CG and multi-shift CG solvers

…lash to QUDA

detar · 2017-10-04T04:12:10Z

Hi Kate, I have approved this request. I'm impressed with all the work you have done! I will want to do some testing now. Best, Carleton

…

On 10/3/17 6:03 PM, maddyscientist wrote: This pull request represents a considerable optimization of the MILC-QUDA interface when running RHMC HISQ fermions, as well as ancillary cleanup and minor fixes: * Offload of the reunitarization calculation to QUDA * Optimization of the gauge / momentum extraction from and packing into the lattice struct, and abstracted this into helper functions in generic_quda.h * Use pinned memory allocations for auxiliary and fermion link-field allocations, this improves PCIe copy efficiency * Do not compute the "double-store dslash" backwards fat and long links when using QUDA. The overhead of creating the fat and long link backward is a large unnecessary overhead (accounts for up to 75% of the total fat/long link computation) (Hence do not use the double-store dslash when compiling with QUDA) * Offload regular dslash calls to QUDA - even with PCIe transfer overhead, this provides a considerable speedup over the standard CPU dslash * Introduce |get_fn_old| and |set_fn_old| functions so that CG, multi-shift CG and dslash offload can use the same invalidation flag (reduces number of gauge field downloads) * Fix flop counter for fat-link computation when GPU offload is enabled * Some ancillary code clean up regarding the QUDA interface The effect of these optimizations is to considerably reduce the time spent in data management between QUDA and MILC, as well as reducing the amount of time spent in unaccelerated CPU code. As an example, running a quick RHMD benchmark on my workstation, running from a cold start with few CG iterations: * Lattice size 16x16x16x32 * Double precision * 10 trajectories, each of 10 MD steps * 6-core Haswell, with 2xP100 GPUs * 2 processes, each with 3 OMP threads Time Metric Old (s) New (s) Total time 367 238 Total QUDA 177 176 Fat-link time (QUDA) 11.1 8.2 Fat-link time (MILC) 27.2 10.0 CG time (QUDA) 38 37 CG time (MILC) 39 37 Fermion force (QUDA) 109 108 Fermion force (MILC) 112 111 Gauge force (QUDA) 10.1 8.3 Gauge force (MILC) 15.1 14.2 Gauge update (QUDA) 7.8 7.8 Gauge update (MILC) 83.1 22.8 We can see that the main improvement is in the reduction between the reported QUDA and MILC times. The difference that remains is due to cache-unfriendly lattice structure in MILC upon which we have to extract and insert the gauge and momentum fields. This overhead is typically 3x greater than the PCIe transfer overhead. This pull request should be done after #14 <#14> has been merged, since it is based from that. Moreover it requires the changes to QUDA in lattice/quda#646 <lattice/quda#646> (feature/gauge-comms-cleanup) are merged in. ------------------------------------------------------------------------ You can view, comment on, or merge this pull request online at: #15 Commit Summary * Merge branch 'feature/no_cpu_refine' into develop * Updated Makefile for QUDA 0.9 * Merge pull request #7 from maddyscientist/patch-1 * Merge pull request #2 from milc-qcd/master * Merge branch 'master' into develop * Merge pull request #3 from milc-qcd/develop * Merge pull request #4 from milc-qcd/develop * Merge pull request #5 from milc-qcd/develop * Merge pull request #6 from milc-qcd/develop * Enable -Wall (with some opt outs), and fix all generated warnings * Fix couple of things in last commit * Add QUDA specific routines for creating gauge and momentum fields, and for extracting and inserting to/from the site array * Cleanup of QUDA gauge force and link update routines to use helper functions introduced in 88e0340 * Offload reunitarization to QUDA if gauge-force offload is enabled. Fallback to CPU if Shcroedinger functional is set * Remove legacy use of use_pinned_memory member in QudaFatLinkArgs_t * Use pinned memory for ASQTAD GPU outer product vectors * When creating/restoring fermion links on GPU, use QUDA helpers for extracting the gauge-field matrices * Use pinned memory allocation for the auxilary and fermion link-field arrays when USE_FL_GPU is enabled * Cleanup: put load_hisq_aux_links_gpu into fermion_links_fn_load_gpu.c and remove fermion_links_hisq_load_gpu.c. Fix flop counter when USE_FL_GPU is enabled. * Fix bug in gauge force introduced in prior cleanup in d7ed3b4 * Minor cleanup of generic_quda.h * Fix warning in dslash_fn.c * If using QUDA, do not use double-store dslash since this adds unnecessary overhead in the fermion-link construction * QUDA-accelerated multi-shift and regular CG HISQ solvers now share the same fn_last cache to avoid unnecessary invalidation when switching between CG and multi-shift CG solvers * When compiling with HISQ CG GPU offload, also offload all calls to dslash to QUDA File Changes * *M* Make_template_combos <https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-0> (7) * *M* Makefile <https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-1> (13) * *M* generic/gauge_force_imp_gpu.c <https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-2> (43) * *M* generic/gauge_stuff.c <https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-3> (20) * *M* generic/io_helpers.c <https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-4> (6) * *M* generic/io_lat_utils.c <https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-5> (10) * *M* generic/reunitarize2.c <https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-6> (54) * *M* generic_ks/Make_template <https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-7> (3) * *M* generic_ks/d_congrad5_fn_gpu.c <https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-8> (11) * *M* generic_ks/d_congrad5_fn_milc.c <https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-9> (1) * *M* generic_ks/dslash_fn.c <https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-10> (51) * *M* generic_ks/fermion_force_asqtad_gpu.c <https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-11> (11) * *M* generic_ks/fermion_links_fn_load_gpu.c <https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-12> (74) * *M* generic_ks/fermion_links_fn_load_milc.c <https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-13> (10) * *M* generic_ks/fermion_links_from_site.c <https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-14> (28) * *D* generic_ks/fermion_links_hisq_load_gpu.c <https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-15> (55) * *M* generic_ks/fermion_links_hisq_load_milc.c <https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-16> (9) * *M* generic_ks/fermion_links_hisq_milc.c <https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-17> (3) * *M* generic_ks/fn_links_milc.c <https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-18> (4) * *M* generic_ks/ks_multicg_offset_gpu.c <https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-19> (21) * *M* include/generic_quda.h <https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-20> (124) * *M* ks_imp_rhmc/update_u.c <https://github.com/milc-qcd/milc_qcd/pull/15/files#diff-21> (33) Patch Links: * https://github.com/milc-qcd/milc_qcd/pull/15.patch * https://github.com/milc-qcd/milc_qcd/pull/15.diff — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#15>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AF_j3oLBC3kQQTFBwR4NqqOj-RfIuIGlks5sostXgaJpZM4Ps77S>.

Develop

maddyscientist and others added 25 commits August 13, 2015 16:10

Merge branch 'feature/no_cpu_refine' into develop

9edc3cd

Updated Makefile for QUDA 0.9

f957855

QUDA 0.9 needs "-lcuda" as well as "-lcudart".

Merge pull request #7 from maddyscientist/patch-1

6798ef4

Updated Makefile for QUDA 0.9

Merge pull request #2 from milc-qcd/master

ae975d9

Merge in latest MILC master

Merge branch 'master' into develop

4a90232

Merge pull request #3 from milc-qcd/develop

a67218e

Merge in latest MILC develop branch

Merge pull request #4 from milc-qcd/develop

ab52202

Develop merge

Merge pull request #5 from milc-qcd/develop

5a17bcc

Gauge force typo

Merge pull request #6 from milc-qcd/develop

f611390

Develop

Enable -Wall (with some opt outs), and fix all generated warnings

7825122

Fix couple of things in last commit

188464f

Add QUDA specific routines for creating gauge and momentum fields, an…

88e0340

…d for extracting and inserting to/from the site array

Cleanup of QUDA gauge force and link update routines to use helper fu…

d7ed3b4

…nctions introduced in 88e0340

Offload reunitarization to QUDA if gauge-force offload is enabled. Fa…

6f23a19

…llback to CPU if Shcroedinger functional is set

Remove legacy use of use_pinned_memory member in QudaFatLinkArgs_t

06aa7b6

Use pinned memory for ASQTAD GPU outer product vectors

a399776

When creating/restoring fermion links on GPU, use QUDA helpers for ex…

814f145

…tracting the gauge-field matrices

Use pinned memory allocation for the auxilary and fermion link-field …

ef7caee

…arrays when USE_FL_GPU is enabled

Cleanup: put load_hisq_aux_links_gpu into fermion_links_fn_load_gpu.c…

44ea6d9

… and remove fermion_links_hisq_load_gpu.c. Fix flop counter when USE_FL_GPU is enabled.

Fix bug in gauge force introduced in prior cleanup in d7ed3b4

e05bd59

Minor cleanup of generic_quda.h

2e773a9

Fix warning in dslash_fn.c

34c0500

If using QUDA, do not use double-store dslash since this adds unneces…

d24fd72

…sary overhead in the fermion-link construction

QUDA-accelerated multi-shift and regular CG HISQ solvers now share th…

c50ef50

…e same fn_last cache to avoid unnecessary invalidation when switching between CG and multi-shift CG solvers

When compiling with HISQ CG GPU offload, also offload all calls to ds…

a7654a0

…lash to QUDA

detar merged commit f407b32 into milc-qcd:develop Oct 4, 2017

detar pushed a commit that referenced this pull request Apr 18, 2019

Merge pull request #15 from milc-qcd/develop

649c19f

Develop

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/quda interface optimize #15

Feature/quda interface optimize #15

maddyscientist commented Oct 4, 2017

detar commented Oct 4, 2017 via email

Feature/quda interface optimize #15

Feature/quda interface optimize #15

Conversation

maddyscientist commented Oct 4, 2017

detar commented Oct 4, 2017 via email