-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster2TP assoc on GPU #105
Conversation
ok to get a list of hits tat can be compared one can use |
indeed zmumu does not reproduce
etc oops, no is the clus2TP that does not fully reproduce in case of multiple TPs...
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spotted a few things that could be cleaned up, otherwise looks good to me.
count = step; | ||
} | ||
return first; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same as in
cmssw/HeterogeneousCore/CUDAUtilities/interface/cudastdAlgorithm.h
Lines 24 to 26 in 64e6201
template<typename RandomIt, typename T, typename Compare = less<T>> | |
constexpr | |
RandomIt lower_bound(RandomIt first, RandomIt last, const T& value, Compare comp={}) |
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but the one cudastd does not compile seems to require __device__ __host__
at least in this context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Should we then consider decorating the cudastd ones with __device__ __host__
? (possibly in a later PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
definetively!
I prefer first we find a location for a macro or something that guarantee __device__ __host__
not be defined if a non cuda compiler is used...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually this is the error message
/home/vin/GPUDoublets/CMSSW_10_2_0_pre6_Patatrack/src/HeterogeneousCore/CUDAUtilities/interface/cudastdAlgorithm.h(46): error: calling a __device__ function("operator()") from a __host__ __device__ function("lower_bound") is not allowed
detected during instantiation of "RandomIt cuda_std::lower_bound(RandomIt, RandomIt, const T &, Compare) [with RandomIt=const std::array<uint32_t, 4UL> *, T=std::array<uint32_t, 4UL>, Compare=lambda [](const std::array<uint32_t, 4UL> &, const std::array<uint32_t, 4UL> &)->bool]"
/home/vin/GPUDoublets/CMSSW_10_2_0_pre6_Patatrack/src/SimTracker/TrackerHitAssociation/plugins/ClusterSLOnGPU.cu(75): here
pretty bizzare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the lambda gets declared only as __device__
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found this https://devblogs.nvidia.com/new-compiler-features-cuda-8/
What would happen with
auto less = [] __host__ __device__ (...)->bool{
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, in in a global function
I can mark it device host
- auto less = [](std::array<uint32_t,4> const & a, std::array<uint32_t,4> const & b)->bool {
+ auto less = [] __device__ __host__ (std::array<uint32_t,4> const & a, std::array<uint32_t,4> const & b)->bool {
ok fine it compiles
I will make new PR, you to judge how ugly is it...
const std::array<uint32_t,4> me{{id,ch,0,0}}; | ||
|
||
auto less = [](std::array<uint32_t,4> const & a, std::array<uint32_t,4> const & b)->bool { | ||
return a[0]<b[0] || ( !(b[0]<a[0]) && a[1]<b[1]); // in this context we do not care of [2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the logic. !(b[0]<a[0])
is equivalent to a[0]<=b[0]
, which given the left side if ||
has the same effect as a[0]==b[0]
. I find the latter easier to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this is the standard way to code lexicographic ordering in std, when the only requirement is the existance of operator< (not operator==)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, thanks. On the other hand in this case the compared types are uint32_t
, but ok.
|
||
cudaCheck(cudaMalloc((void**) & slgpu.me_d, sizeof(ClusterSLGPU))); | ||
cudaCheck(cudaMemcpyAsync(slgpu.me_d, &slgpu, sizeof(ClusterSLGPU), cudaMemcpyDefault, stream.id())); | ||
cudaCheck(cudaDeviceSynchronize()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this synchronization is not needed.
cudaCheck(cudaMalloc((void**) & slgpu.n2_d,(ClusterSLGPU::MaxNumModules*256)*sizeof(uint32_t))); | ||
|
||
|
||
cudaCheck(cudaMalloc((void**) & slgpu.me_d, sizeof(ClusterSLGPU))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these freed anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oopsss, no.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
assert(sl.me_d); | ||
simLink<<<blocks, threadsPerBlock, 0, stream.id()>>>(dd.me_d,ndigis, hh.gpu_d, sl.me_d,n); | ||
cudaStreamSynchronize(stream.id()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this synchronization is not needed (even for the dump below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is needed for the dump below in case of other printf (can go inside the if)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? dumpLink
below is launched asynchronously on the same CUDA stream, so I'd expect it to work without this synchronization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is printf that requires syncronization to dump the buffer to host.
otherwise it will overwrite the circular one on device.
at least this is what I understood (and observed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so you want to protect against any potential earlier printf? Then yes, please move to inside the if (with a comment explaining the need).
|
||
iEvent.put<Output>(std::move(output), [legacy](const GPUProduct& hits, CPUProduct& cpu) { | ||
cpu = *legacy; delete legacy; | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice example in favor of #100.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
INDEED
+1 from me |
some level of irreproducibility exists:
details changes in few cases: some can be attributed to cluster2TP (to be investigated)
other seems really coming from the clusterizer (always for clusters not associated to any TP???)
|
Validation summaryReference release CMSSW_10_2_0_pre6 at a674e1f
|
@@ -1,6 +1,15 @@ | |||
#ifndef HeterogeneousCore_CUDAUtilities_cudastdAlgorithm_h | |||
#define HeterogeneousCore_CUDAUtilities_cudastdAlgorithm_h | |||
|
|||
#ifdef __CUDACC__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, #include <cuda_runtime.h>
is enough, as it #define
s away the CUDA-specific attributes when not building for CUDA (i.e. if __CUDACC__
is not defined).
The downside is that one must <use name="cuda"/>
in the BuildFile, to let the compiler find cuda_runtime.h
in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so what is the decision?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strictly speaking clients of HeterogeneousCore/CUDAUtilities should use it
(in test as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried to change the #define
s, but I run into link errors: my guess is that some code sees the __host__ __device__
as __attribute__((host)) __attribute__((device))
, while other code sees an empty #define
, and the two symbols to not match...
I think the best soultion would be either to #include <cuda_runtime.h>
, or to patch the CUDA API wrappers to include that one instead of some internal CUDA includes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved to #include <cuda_runtime.h>
and fixed the other ```NVCC``
#ifndef SimTrackerTrackerHitAssociationClusterHeterogeneousProduct_H | ||
#define SimTrackerTrackerHitAssociationClusterHeterogeneousProduct_H | ||
|
||
#ifndef __NVCC__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please check for __CUDACC__
rather than __NVCC__
|
||
namespace trackerHitAssociationHeterogeneousProduct { | ||
|
||
#ifndef __NVCC__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
ClusterSLGPU * gpu_d=nullptr; | ||
}; | ||
|
||
#ifndef __NVCC__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
Fixed in #111. |
* First implementation of deep tau id. * Building dpf isolation module * Adding in v1 * Adding in runTauIDMVA for other users * making things fully reproducible * Reorganisation of configuration files: cff split to cfi and cff * Some code cleaning * adapt to cfi/cff reorganization * Review of DPF and DeepTauId code. - Defined base class for deep tau discriminators. - Removed weight files from home cms repository. Now using weights from cms-data. - Defined WP for both discriminators. Now all discriminators return the corresponding WP results. - Removed cfi files. Using fillDescriptions instead. - General code review and cleaning. * Added example of a python configuration file to produce pat::Tau collection with the new Tau-Ids * requested changes on runDeepTauIDsOnMiniAOD.py * Clean runTauIdMVA.py tool and test config to run tauIDs * Made DeepTauId and DPFIsolation thread-safe * Finish implement thread-safe requirements on DPFIsolation * Disable DPFTau_2016_v1 and issue some warnings * Remove assigning value of variable to itself * - Implemented on runTauIdMVA the option to work with new training files quantized - Added a new parameter 'version' on runTauIdMVA, used on DPFIsolation - Changes on DeepTauId to reduce memory consumption * - Implementation of global cache to avoid reloading graph for each thread and reduce the memory consuption - Creation of class DeepTauCache in DeepTauBase, in which now is created graph and session - Implementation of two new static methods inside the class DeepTauBase: initializeGlobalCache and globalEndJob. The graph and DeepTauCache object are created now inside initializeGlobalCache * Applied changes on DeepTauBase to allow load new training files using memory mapping * Implemented TauWPThreshold class. TauWPThreshold class parses WP cut string (or value) provided in the python configuration. It is needed because the use of the standard StringObjectFunction class to parse complex expression results in an extensive memory usage (> 100 MB per expression). * Remove the qm.pb input files and leaving just the quantized and the original files * -Overall, changes to improve memory usage, among these are: - Implementation of global cache to avoid reloading graph for each thread - Creation of two new static methods inside the class DeepTauBase: initializeGlobalCache and globalEndJob. The graph and DeepTauCache object are created now inside initializeGlobalCache. The memory consumption of initializeGlobalCache for the original, quantized and files that are load using memory mapping method are in the memory_usage.pdf file - Implemented configuration to use new training files quantized, and set them as default - Implementation of configuration for load files using memory mapping. In our case there wasn't any improvement, respect at the memory consumption of this method, respect the quantized files, so this is not used, but set for future training files - General code review and cleaning. * Applied style comments * Applied style comments * Applied comments * Change to be by default the original training file for deepTau, instead of the quantized * Changes regarding forward-porting DNN-related developments from the PRs #105 and #106 from 94X to 104X * Applied commets of previus PR * cleaning code * Modification in the config to work with new label in files * Applied comment about the expected format of name of training file * Fix in last commit * Applied last comments * Changes regarding forward-porting DNN-related developments from the PRs #105 and #106 from 94X to 104X * Applied @perrotta comments on 104X * Fix error * Applied comments * Applied comments * Fix merge problem * Applied a few commets * Applied more changes * Applied a few small followups * Fixed error on DPFIsolation * Update DPFIsolation.cc * - RecoTauTag/RecoTau/plugins/DeepTauId.cc: Remove ' clusterVariables 'as a class member - RecoTauTag/RecoTau/test/runDeepTauIDsOnMiniAOD.py: Update globaltag and sample * Added changes in RecoTauTag/RecoTau/python/tools/runTauIdMVA.py made in the commit 194a1d5 from the PR cms-sw#25016 * Fix error on runDeepTauIDsOnMiniAOD * Change the GT in RecoTauTag/RecoTau/test/runDeepTauIDsOnMiniAOD.py
Implement a heterogeneous Cluster-to-TrackingParticle associator running on the GPU.
Implement a heterogeneous Cluster-to-TrackingParticle associator running on the GPU.
Implement a heterogeneous Cluster-to-TrackingParticle associator running on the GPU.
Implement a heterogeneous Cluster-to-TrackingParticle associator running on the GPU.
Implement a heterogeneous Cluster-to-TrackingParticle associator running on the GPU.
Implement a heterogeneous Cluster-to-TrackingParticle associator running on the GPU.
Implement a heterogeneous Cluster-to-TrackingParticle associator running on the GPU.
Implement a heterogeneous Cluster-to-TrackingParticle associator running on the GPU.
Implement a heterogeneous Cluster-to-TrackingParticle associator running on the GPU.
Implement a heterogeneous Cluster-to-TrackingParticle associator running on the GPU.
Implement a heterogeneous Cluster-to-TrackingParticle associator running on the GPU.
Implement a heterogeneous Cluster-to-TrackingParticle associator running on the GPU.
Implement a heterogeneous Cluster-to-TrackingParticle associator running on the GPU.
Implement a heterogeneous Cluster-to-TrackingParticle associator running on the GPU.
Implement a heterogeneous Cluster-to-TrackingParticle associator running on the GPU.
Implement a heterogeneous Cluster-to-TrackingParticle associator running on the GPU.
Implement a heterogeneous Cluster-to-TrackingParticle associator running on the GPU.
Implement a heterogeneous Cluster-to-TrackingParticle associator running on the GPU.
Implement a heterogeneous Cluster-to-TrackingParticle associator running on the GPU.
Implement a heterogeneous Cluster-to-TrackingParticle associator running on the GPU.
Implement a heterogeneous Cluster-to-TrackingParticle associator running on the GPU.
Implement a heterogeneous Cluster-to-TrackingParticle associator running on the GPU.
Implement a heterogeneous Cluster-to-TrackingParticle associator running on the GPU.
This PR contains mostly a cluster to trackingParticle association on GPU
It includes the possibility to dump all hits on cpu with the corresponding TP
adding to config
process.tpClusterProducerHeterogeneous.dumpCSV = True
of course
process.tpClusterProducerHeterogeneousPixelTrackingOnly.dumpCSV = True
for our workflows
it also includes a "proto" doublet code ready to produce Cells to be consumed by the CA.
I prefer this is merged now.
We will proceed to create Cells and use them in CA later