-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dorado 0.8.2 crashes when calling methylation #1098
Comments
Crashing w. 1 GPU and SUP without modifications despite reduced batch size but with all files from a run:
|
Crashing with 1 GPU and 1 pod5 file with modifications:
|
Hi, From looking at the timings of the exceptions that were caught, these seem to happen ~20-30mins after startup. Does this reliably happen after ~20mins? You can Some of the crashes in #1070 seem very similar to pytorch/pytorch#74235, so can you try some of the steps provided on there to confirm that it's not a system issue? Notably checking that your BIOS is up to date. Thanks, |
Hi Ben It does not. appear to be the case with the 20 min intervals. I tried with two different datasets and those finished fine without issues. Rasmus |
Failed with launchblocking=1 as well on the original dataset.
|
Hmm this is issue is weird. Could it be something within dorado that is poor at handling temporary issues when starting up? I ran the same sbatch script with the same data with our 2xA10 node twice. The first time it crashed and the second time it completed perfectly fine. I have managed to run the same script with different datasets (barcodes from the same run) a few times and occasionally I get an error but often it runs smoothly when rerunning.
|
@Kirk3gaard Did you find a solution for this problem seeing that you closed the issue? |
Issue Report
Please describe the issue:
dorado 0.8.2 (also) crashes when basecalling with modifications error message below (#1069 )
Please provide a clear and concise description of the issue you are seeing and the result you expect.
Steps to reproduce the issue:
Please list any steps to reproduce the issue.
Run environment:
Logs
The text was updated successfully, but these errors were encountered: