BadAlloc exception with GPU WF on non-GPU machine. #33103

thomreis · 2021-03-08T16:51:17Z

I get an BadAlloc exception when running the GPU WF 10824.512 on a standard lxplus node. I thought this behaviour was changed to a more explanatory exception when CUDAService is disabled with #32155.

Release is CMSSW_11_3_0_pre3.
Command: runTheMatrix.py -w relval_gpu -l 10824.512

%MSG-w CUDAService:  (NoModuleName) 08-Mar-2021 17:08:57 CET pre-events
Failed to initialize the CUDA runtime.
Disabling the CUDAService.
%MSG
08-Mar-2021 17:09:02 CET  Initiating request to open file file:step2.root
08-Mar-2021 17:09:04 CET  Successfully opened file file:step2.root
Begin processing the 1st record. Run 1, Event 1, LumiSection 1 on stream 0 at 08-Mar-2021 17:09:10.573 CET

std::bad_alloc exception

std::bad_alloc exception

std::bad_alloc exception

std::bad_alloc exception
----- Begin Fatal Exception 08-Mar-2021 17:09:11 CET-----------------------
An exception of category 'BadAlloc' occurred while
   [0] Running EventSetup component EcalElectronicsMappingGPUESProducer/'ecalElectronicsMappingGPUESProducer
Exception Message:
A std::bad_alloc exception was thrown.
The job has probably exhausted the virtual memory available to the process.
----- End Fatal Exception -------------------------------------------------
Another exception was caught while trying to clean up files after the primary fatal exception.
08-Mar-2021 17:09:11 CET  Closed file file:step2.root

The text was updated successfully, but these errors were encountered:

cmsbuild · 2021-03-08T16:51:39Z

A new Issue was created by @thomreis Thomas Reis.

@Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel · 2021-03-08T17:48:15Z

assign heterogeneous

cmsbuild · 2021-03-08T17:48:30Z

New categories assigned: heterogeneous

@makortel,@fwyzard you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel · 2021-03-08T19:58:00Z

@thomreis The 10824.512 is a "GPU-only" workflow, right? (i.e. without SwitchProducer)

There are some differences compared to
#31719 (comment):

there the symptom was a segfault, here it is an exception
there the segfault was caused by an EDModule, here the exception is thrown from an ESProducer

Likely what happens is that some CUDA EDModule declares to consume the ESProduct produced by ecalElectronicsMappingGPUESProducer, and then in the prefetching phase of that EDModule the ESProducer runs, and some CUDA call there leads to the exception (I'd guess HostAllocator but I didn't check).

As things are now it is hard to do much better (unless we add explicit check on CUDAService::enabled() to every CUDA ESProducer; I'd not). The next revision of the pattern for CUDA ESProducers (I was thinking to do it as part of #30266, but maybe it could be done separately as well) an exception similar to #32155 would emerge rather naturally.

thomreis · 2021-03-08T20:03:59Z

Hi @makortel thanks for the explanation. I should have looked closer but on first glance it reminded me of the segfault from before. Closing this then.

cmsbuild added the pending-assignment label Mar 8, 2021

cmsbuild added heterogeneous-pending pending-signatures and removed pending-assignment labels Mar 8, 2021

thomreis closed this as completed Mar 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BadAlloc exception with GPU WF on non-GPU machine. #33103

BadAlloc exception with GPU WF on non-GPU machine. #33103

thomreis commented Mar 8, 2021

cmsbuild commented Mar 8, 2021

makortel commented Mar 8, 2021

cmsbuild commented Mar 8, 2021

makortel commented Mar 8, 2021

thomreis commented Mar 8, 2021

BadAlloc exception with GPU WF on non-GPU machine. #33103

BadAlloc exception with GPU WF on non-GPU machine. #33103

Comments

thomreis commented Mar 8, 2021

cmsbuild commented Mar 8, 2021

makortel commented Mar 8, 2021

cmsbuild commented Mar 8, 2021

makortel commented Mar 8, 2021

thomreis commented Mar 8, 2021