-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Passing bam::record::Record
between threads causes a segfault
#293
Comments
@DonFreed Do you still get the error if you manually allocate and fill in a read to iterate over the bam? Then copy to pass it to the channel. |
@sitag Thanks for the response. Manually allocating multiple records and moving them to a second thread produces the same type of error. Manually allocating a single record, passing this record back and forth between threads and eventually dropping the record in the original thread seems to work, which is probably an acceptable workaround for my use-case. Here's a gist with modifications to the earlier example code that tests both cases, https://gist.github.com/DonFreed/c42725df1e3f81cfbd7e527016d34d6d. |
@DonFreed I suspect that it has to do with drop - there is a call to |
|
@jch-13 |
|
There's always the option to remove the header reference. I've always found it a bit weird given the structure of a sam/bam/cram that every record should contain a ref to the header. On the other hand I usually do my parallelization over chromsomes/contigs using the indexedreader. This saves the burden of sending records across channels. But this might not be viable for every genome.. |
@jch-13 Unless I am misunderstanding something, since |
I think it's a bit painful to see a segfaulting issue go stale, so let me try to revive this. It seems clear that the current situation is invalid. I see three possible solutions also mentioned before:
Anyone see another way out of this? Do we take a vote? I'm on 1 or 3, but mostly 3. |
@veldsla I agree. As it stands the |
I looked at the implications of removal. Mostly straightforward, except that it breaks the The implementation is not really that well suited anyway. I have a branch with the changes, so a PR is trivial. @johanneskoester, @pmarks, any thoughts? |
@veldsla I wonder if we could collect more data on 2? My naive expectation is that it would be a negligible performance difference that would be difficult to observe. Might be interesting to do a benchmark before we rule out that option on the grounds of performance, but I do agree this would go against the "slow features should be opt-in" principle. For example if decrementing the Arc is very fast compared to the rest of the I'm also OK with 3 if that's what people prefer. 1 is a definite no-go. |
@pmarks I did some tests using a simple program that reads every record in a single threaded loop from a 1Gb 25M reads bam file. Indeed it seems that the performance difference in that scenario is not too big. Depending on the number of read threads I saw very little change using 4 threads:
On a larger machine an upping the number of read threads to 20 a minor penalty becomes apparent:
Of course this is only the case when the file is cached. In practice these operations will probably be mostly disk-bound. That doesn't mean the Arc has no overhead, but we're just limited by the reader. |
Having seen these results, I still don't change my preference. I think the header has no useful place in the Record and the sacrificing of the |
@pmarks @veldsla I don't think 25M reads represents the kind of work flow we deal with. For methylation data we regularly work with 1G reads; we expect data to get bigger not smaller. A couple of years back we looked at both a young rust-htslib, and a more mature bioD/sambamba, and went with bioD/sambamba because it was slightly more performant and that mattered, even though D came with tooling headaches. I think we should not sacrifice performance for ergonomics and abstraction. |
I deliberately used a small file to make sure IO was less of an issue, but already it is clear that we are doing IO and that basically boils down to waiting. So unless you create your bam::Record's new from some other source the limiting factor will be the reading/block decompressing. We can still do a synthetic benchmark on a struct with an Arc vs a struct with (or without an Rc), and I'm sure the difference will be a lot more obvious, but in practice you probably won't observe it. But yes, I think so too:
|
Reviving this issue again. Personally, my vote is for 3. I would also think that 1 is a no-go as being unable to pass bam::Records between threads would limit the usefulness of the library. I'm not sure what the reasoning is for implementing |
Trying to revive again. What about adding a wrapper struct that contains a |
Another option would be to have the Arc/Send+Sync variant (or AbstractInterval?) behind a feature gate which isn't part of the default features. |
Good idea, but wouldn't that mean you can't use both |
Hi reviving this as well again, is this still a problem nowadays?
Executing it multiple times over a bam file succeeds in general but occasionally would generate a core dump |
Okay this is pretty strange as it works flawlessly if I pass a copy of the header inside via:
|
The Rc issues were never fixed. Somebody needs to pick a solution discussed above. The cloning of the header probably avoids the illegal access. You could also consider switching to noodles. It provides (async!) reader/writers for BAM and CRAM (but no unified record). It's still actively changing, but releases on crates.io and has a very responsive maintainer. |
Thanks, I see. First I thought I would not mind at the moment just passing the header all the time as it worked well on small subsets. Update: Tried to pass through a chunk iterator instead of the records but problem remains. |
Reviving this issue again. Personally, I will go for 3. Can we have a solution for this? |
I made a draft PR (#345) trying to work around this particular soundness issue. Please let me know what you think. |
Should use less memory, see rust-bio/rust-htslib#293
Thank you for the very nice library!
I'm working on a tool that passes reads between threads. Unfortunately, the tool is producing non-deterministic segmentation faults and other memory errors. I've traced some of these issues back to rust-htslib, which will crash somewhat randomly when
bam::record::Record
s are passes between threads. I am not too familiar with this library, so I am wondering if this is the expected behavior?Here is a simplified example that can reproduce the crash:
Compiling with
RUSTFLAGS="-g" cargo build
and then running with a SAM passed through stdin produces the following:Re-running the same command will produce the crash in different parts of the program. I've attached a backtrace from one of the crashes: crash backtrace.txt
The text was updated successfully, but these errors were encountered: