Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BadAlloc exception with GPU WF on non-GPU machine. #33103

Closed
thomreis opened this issue Mar 8, 2021 · 5 comments
Closed

BadAlloc exception with GPU WF on non-GPU machine. #33103

thomreis opened this issue Mar 8, 2021 · 5 comments

Comments

@thomreis
Copy link
Contributor

thomreis commented Mar 8, 2021

I get an BadAlloc exception when running the GPU WF 10824.512 on a standard lxplus node. I thought this behaviour was changed to a more explanatory exception when CUDAService is disabled with #32155.

Release is CMSSW_11_3_0_pre3.
Command: runTheMatrix.py -w relval_gpu -l 10824.512

%MSG-w CUDAService:  (NoModuleName) 08-Mar-2021 17:08:57 CET pre-events
Failed to initialize the CUDA runtime.
Disabling the CUDAService.
%MSG
08-Mar-2021 17:09:02 CET  Initiating request to open file file:step2.root
08-Mar-2021 17:09:04 CET  Successfully opened file file:step2.root
Begin processing the 1st record. Run 1, Event 1, LumiSection 1 on stream 0 at 08-Mar-2021 17:09:10.573 CET

std::bad_alloc exception

std::bad_alloc exception

std::bad_alloc exception

std::bad_alloc exception
----- Begin Fatal Exception 08-Mar-2021 17:09:11 CET-----------------------
An exception of category 'BadAlloc' occurred while
   [0] Running EventSetup component EcalElectronicsMappingGPUESProducer/'ecalElectronicsMappingGPUESProducer
Exception Message:
A std::bad_alloc exception was thrown.
The job has probably exhausted the virtual memory available to the process.
----- End Fatal Exception -------------------------------------------------
Another exception was caught while trying to clean up files after the primary fatal exception.
08-Mar-2021 17:09:11 CET  Closed file file:step2.root
@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 8, 2021

A new Issue was created by @thomreis Thomas Reis.

@Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

makortel commented Mar 8, 2021

assign heterogeneous

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 8, 2021

New categories assigned: heterogeneous

@makortel,@fwyzard you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

makortel commented Mar 8, 2021

@thomreis The 10824.512 is a "GPU-only" workflow, right? (i.e. without SwitchProducer)

There are some differences compared to
#31719 (comment):

  • there the symptom was a segfault, here it is an exception
  • there the segfault was caused by an EDModule, here the exception is thrown from an ESProducer

Likely what happens is that some CUDA EDModule declares to consume the ESProduct produced by ecalElectronicsMappingGPUESProducer, and then in the prefetching phase of that EDModule the ESProducer runs, and some CUDA call there leads to the exception (I'd guess HostAllocator but I didn't check).

As things are now it is hard to do much better (unless we add explicit check on CUDAService::enabled() to every CUDA ESProducer; I'd not). The next revision of the pattern for CUDA ESProducers (I was thinking to do it as part of #30266, but maybe it could be done separately as well) an exception similar to #32155 would emerge rather naturally.

@thomreis
Copy link
Contributor Author

thomreis commented Mar 8, 2021

Hi @makortel thanks for the explanation. I should have looked closer but on first glance it reminded me of the segfault from before. Closing this then.

@thomreis thomreis closed this as completed Mar 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants