-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SDL::modulesBuffer as an ES product #16
SDL::modulesBuffer as an ES product #16
Conversation
Note to self: needs to be rebased |
…r in the device space with Alpaka
9d57ff7
to
cbfc1aa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure whether this push included any additional development and whether things work now. From a first look, the logic makes sense and I mostly have comments to verify that I understand things.
If things still do not work, could you remind me what you see at the output, i.e. which part of the buffers doesn't get properly propagated? In this case, debugging will be needed as I see no smoking gun issue.
|
||
namespace cms::alpakatools { | ||
template <> | ||
struct CopyToDevice<SDL::modulesBuffer<alpaka_common::DevHost>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just so that I can understand better, what is the alpaka_common::DevHost
type used here? A type to be used for alpaka buffers by both host and device?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in alpaka_common
it comes as using DevHost = alpaka::DevCpu
} | ||
|
||
std::optional<SDL::modulesBuffer<alpaka_common::DevHost>> LSTModulesDevESProducer::produce(const TrackerRecoGeometryRecord &iRecord) { | ||
// write directly to SDL : FIXME : SHOULD NOT HAPPEN HERE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this comment refer to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it refers to the lines below L47-48 write to the SDL globals; this should eventually become local
SDL::modulesBuffer<alpaka_common::DevHost> modules(cms::alpakatools::host()); | ||
alpaka::QueueCpuBlocking queue(cms::alpakatools::host()); | ||
SDL::LST::loadMaps(); | ||
SDL::loadModulesFromFile(&modules, SDL::nModules, SDL::nLowerModules, *SDL::pixelMapping, queue, txtFile_.c_str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason to treat modules
and pixelMapping
separately, i.e. one is constructed here, while the other in the SDL::Event
constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't get there yet
now rebased |
after rebase I have an error
which I don't know yet if it's related to my ES work or some mess up I have in the configuration |
@@ -0,0 +1,3 @@ | |||
#include "RecoTracker/LST/interface/LSTModulesDev.h" | |||
#include "HeterogeneousCore/AlpakaCore/interface/alpaka/typelookup.h" | |||
TYPELOOKUP_ALPAKA_TEMPLATED_DATA_REG(SDL::modulesBuffer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @makortel
Is there a way to avoid having an explicit host-specific RecoTracker/LST/src/ES_ModulesDev.cc ?
For some reason this TEMPLATED_DATA_REG
leads to appropriate symbols in libRecoTrackerLSTCudaAsync.so
, but nothing in the libRecoTrackerLSTSerialSync.so
This behavior doesn't change if I remove RecoTracker/LST/src/ES_ModulesDev.cc
.
Can LTO be at play in some way?
I tried to add <flags CXXFLAGS="-fno-lto"/>
, but it didn't make the CPU ES edm::typelookup::className
symbols show up in the CPU/serial so file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to avoid having an explicit host-specific RecoTracker/LST/src/ES_ModulesDev.cc ?
No, the behavior is intentional (at least for now). We want the non-Alpaka modules to be able to consume the host-side EventSetup data products (analogously to host-side Event data products). From there it follows that the symbols of the host-side data products must be in the non-Alpaka library of RecoTracker/LST
, and the .cc
file containing the TYPELOOKUP_DATA_REG()
for the host-specific data product must reside there.
I'm trying to debug a setup with dual external build (
the commit range that introduces the dual cpu/cuda build and also the ES details are in a0caa0f...cbfc1aa |
@slava77 Does the crash happen when running on CPU, or when running on GPU? |
I also assume that Is |
it's a default alpaka for LST modules, so I guess it's GPU. |
<tool name="lst_headers" version="1.0">
<client>
<environment name="LSTBASE" default="$CMSSW_BASE/external/lst/TrackLooper"/>
<environment name="INCLUDE" default="$LSTBASE"/>
</client>
<runtime name="LST_BASE" value="$LSTBASE"/>
</tool> |
I didn't check. The alpaka kernels will be different symbols, but yes, there can be the same symbols for non-dev-templated code Recall that the code without my ES-related commits (up to 72edb80 in the diff runs |
with debug symbols added I found that the crash is in m_record = {type_ = {
<edm::TypeIDBase> = {t_ = 0x7fff1edb2fa0},
m_name = 0x8e44 <error: Cannot access memory at address 0x8e44>}},
m_key = {type_ = {<edm::TypeIDBase> = {
t_ = 0x7fff94c048a8 <typeinfo for TrackerGeometry>},
m_name = 0x7fff94c00066 "TrackerGeometry"} So, oddly enough the ES record detail gets corrupted and it's in a non-LST module. |
Thanks @slava77. Going with the hypothesis the cause would be an ODR violation, I'd suggest the following in order to move forward (it's a little bit nasty, bit could be technically ok for now, given that the Replace #include <SDL/Module.h>
#include "HeterogeneousCore/AlpakaCore/interface/alpaka/typelookup.h"
// Temporary hack: The DevHost instantiation is needed in both CPU and GPU plugins,
// whereas the (non-host-)Device instantiation only in the GPU plugin
TYPELOOKUP_DATA_REG(SDL::modulesBuffer<alpaka_common::DevHost>);
TYPELOOKUP_ALPAKA_TEMPLATED_DATA_REG(SDL::modulesBuffer); Move the specialization of Remove Then a higher level question: are you using ESProducer only for the purpose to copy the data to all devices, or is it foreseen for the ES product to be used (shared) by many EDModules? |
LSTModulesDevESProducer::LSTModulesDevESProducer(const edm::ParameterSet &iConfig) | ||
: ESProducer(iConfig), txtFile_{iConfig.getParameter<edm::FileInPath>("txt").fullPath()} | ||
{ | ||
setWhatProduced(this, iConfig.getParameter<std::string>("ComponentName")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ComponentName
is strictly speaking redundant, because the framework already provides the same functionality out of the box with appendToDataLabel
parameter.
SDL::modulesBuffer<alpaka_common::DevHost> modules(cms::alpakatools::host()); | ||
alpaka::QueueCpuBlocking queue(cms::alpakatools::host()); | ||
SDL::LST::loadMaps(); | ||
SDL::loadModulesFromFile(&modules, SDL::nModules, SDL::nLowerModules, *SDL::pixelMapping, queue, txtFile_.c_str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(possibly only just curious) For what is the CPU queue
used here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The queue is used in e.g. alpaka::memcpy(queue, modulesBuf->nModules_buf, src_view_nModules);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guessing from the variable names, is this a copy from non-Alpaka-buffer-CPU-memory (src_view_nModules
) to a Alpaka-buffer-CPU-memory (modulesBuf->nModules_buf
)? If this is the case, could the copy be done with std::memcpy()
(or with an assignment if the thing being copied is one number)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general we have struct modulesBuffer<TAcc>* modulesBuf
and the method is used to fill the GPU modules as well (in the standalone version).
In this specific case in the ESProducer, things are explicitly on host/CPU.
So, I'm not sure how to interpret your suggestion to use std::memcpy
in this combined context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I understand the code is more general towards the target device of that memcpy()
, and the host target is just a special case. Thanks for the clarification. Would a comment be worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a comment be worth it?
sure, I can add
Don't we have a more substantive problem though:
libsdl_cpu.so and libsdl_cuda.so contain overlapping sets of symbols. |
Good point, already this setup should be a problem. For time being (in order to make progress, but to be undoed in the future) it would be ok to move the
I forgot to ask earlier if these "same symbols" are
I have been assuming the latter (and now I can't remember if it is just my assumption, or if we discussed about this detail in the hackathon).
No, library dependencies via plugins do not get any "extra safety". The library setup is somewhat different in that case though (like the loading order), so if the problem is really in ODR violation (which is undefined behavior), I could believe one setup to look like it works, and the other to fail in strange way. |
both kind; there are quite a few helper methods that are independent of the accelerator type |
@makortel |
this does not compile:
|
I don't really know, but I'd guess the full ASAN build would be safer. I wonder if we already use |
Mmkay, the error looks strange, as the |
https://github.com/cms-sw/cmssw-config/blob/7061c60a2606087a9f8ff456d3afc4bbc78a0fca/Projects/CMSSW/Self.xml#L57 |
sorry, I stopped on the first part of the error message:
there is also a GPU variant of the same set (not included above) |
Thanks, although the rest didn't really help. I tested the following #include "HeterogeneousCore/AlpakaCore/interface/alpaka/typelookup.h"
template <typename Dev>
struct FooTest {
int foo;
};
TYPELOOKUP_DATA_REG(FooTest<alpaka_common::DevHost>);
TYPELOOKUP_ALPAKA_TEMPLATED_DATA_REG(FooTest); as a |
I was able to reproduce the compilation error by adding I'd suggest to enclose those type aliases in SDL into a namespace. Or open an issue in CMSSW GitHub that the setup in |
I moved to |
I updated the PR description now that this seems to run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This side of the ES producer developments looks ok to me, so I am approving. I will push the button to merge at the same time that the Tracklooper part is also ready.
This is coupled with SegmentLinking/TrackLooper#355 where
SDL::modulesBuffer
for its const ES-like part was supplanted bySDL::modulesBufferES
, which is not owned by SDL Event.This allows to create an instance of a
modulesBuffer
in CMSSW side as an ES product which can be consumed directly in the SDL code.