Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Passing bam::record::Record between threads causes a segfault #293

Open
DonFreed opened this issue Jan 22, 2021 · 28 comments
Open

Passing bam::record::Record between threads causes a segfault #293

DonFreed opened this issue Jan 22, 2021 · 28 comments

Comments

@DonFreed
Copy link

Thank you for the very nice library!

I'm working on a tool that passes reads between threads. Unfortunately, the tool is producing non-deterministic segmentation faults and other memory errors. I've traced some of these issues back to rust-htslib, which will crash somewhat randomly when bam::record::Records are passes between threads. I am not too familiar with this library, so I am wondering if this is the expected behavior?

Here is a simplified example that can reproduce the crash:

use std::error::Error;
use std::str;
use std::sync::mpsc::{self, Receiver};
use std::thread;

use rust_htslib::bam::{Read, Reader};
use rust_htslib::bam::record::Record;

fn sum_mapqs(rx: Receiver<Option<Record>>) -> Result<(), Box<dyn Error>> {
    let mut total_mapq = 0u64;
    loop {
        match rx.recv() {
            Ok(x) => {
                match x {
                    Some(read) => {
                        let mapq = read.mapq();
                        total_mapq = total_mapq.saturating_add(mapq as u64);
                    },
                    None => {  // No more data
                        println!("Total MapQ: {}", total_mapq);
                        return Ok(());
                    },
                }
            },
            Err(e) => {
                eprintln!("Error reciving data: {}", e);
            }
        }
    }
}

fn main() -> Result<(), Box<dyn Error>> {
    let mut bam = match Reader::from_stdin() {
        Ok(bam) => { bam },
        Err(e) => { return Err(Box::new(e)); },
    };

    // Initialize the writer thread
    let (tx, rx) = mpsc::channel();
    let writer = thread::spawn(move || {
        if let Err(e) = sum_mapqs(rx) {
            eprintln!("Error writing output - {}", e);
        }
    });

    for read in bam.records() {
        match read {
            Ok(read) => {
                eprintln!("Parsed read: {}", str::from_utf8(read.qname()).unwrap());
                if let Err(e) = tx.send(Some(read)) {
                    eprintln!("Error sending data to writer thread");
                    return Err(Box::new(e));
                }
            },
            Err(e) => {
                return Err(Box::new(e));
            },
        }
    }

    // Close the spawned thread
    let _ = tx.send(None);
    let _ = writer.join().unwrap();
    Ok(())
}

Compiling with RUSTFLAGS="-g" cargo build and then running with a SAM passed through stdin produces the following:

$ cat test.sam | target/debug/rust-htslib-crash
...
Parsed read: H203:185:D2990ACXX:4:1101:11561:5493
23559 Broken pipe             cat test.sam
23560 Segmentation fault      (core dumped) | target/debug/rust-htslib-crash

Re-running the same command will produce the crash in different parts of the program. I've attached a backtrace from one of the crashes: crash backtrace.txt

@sitag
Copy link
Contributor

sitag commented Jan 24, 2021

@DonFreed Do you still get the error if you manually allocate and fill in a read to iterate over the bam? Then copy to pass it to the channel.

@DonFreed
Copy link
Author

@sitag Thanks for the response. Manually allocating multiple records and moving them to a second thread produces the same type of error. Manually allocating a single record, passing this record back and forth between threads and eventually dropping the record in the original thread seems to work, which is probably an acceptable workaround for my use-case.

Here's a gist with modifications to the earlier example code that tests both cases, https://gist.github.com/DonFreed/c42725df1e3f81cfbd7e527016d34d6d.

@sitag
Copy link
Contributor

sitag commented Jan 25, 2021

@DonFreed I suspect that it has to do with drop - there is a call to free in there. Good to know something is working for you, though passing the same record back and forth may mean you are blocking on that record being available, so you are not getting the full parallelism you want. You could just use rayon and turn the interator into parallel iterator and not worry about channels.

@jch-13
Copy link
Contributor

jch-13 commented Jan 25, 2021

bam::record::Record holds a Rc (non-thread-safe smart pointer) to the corresponding BAM header. When I replace Rc with the thread-safe alternative Arc in rust-htslib's src/bam/{record,mod}.rs, the segfault disappears.

@sitag
Copy link
Contributor

sitag commented Jan 25, 2021

@jch-13 bam::record::Record should not be marked sync then.

@jch-13
Copy link
Contributor

jch-13 commented Jan 25, 2021

@jch-13 bam::record::Record should not be marked sync then.

Send and Sync are implemented manually for Records here.
Simply replacing Rc with Arc would likely come with a performance penalty. So I'm not sure how to move on from here... Do we need a smart pointer there or would even a simple reference do?

@veldsla
Copy link
Contributor

veldsla commented Jan 25, 2021

There's always the option to remove the header reference. I've always found it a bit weird given the structure of a sam/bam/cram that every record should contain a ref to the header.

On the other hand I usually do my parallelization over chromsomes/contigs using the indexedreader. This saves the burden of sending records across channels. But this might not be viable for every genome..

@sitag
Copy link
Contributor

sitag commented Jan 25, 2021

@jch-13 Unless I am misunderstanding something, since Record is using non threadsafe things, Send and Sync traits should not be implemented. That correctly conveys that the users of the api are responsible for ensuring thread safety, and we can avoid this situation.

@jch-13
Copy link
Contributor

jch-13 commented Jan 25, 2021

Yes, you're absolutely right. However, being able to pass Records around between threads is quite a useful feature and it was there before the Rc was added in 6c537a0 / 3fa020d.

@veldsla
Copy link
Contributor

veldsla commented Feb 19, 2021

I think it's a bit painful to see a segfaulting issue go stale, so let me try to revive this. It seems clear that the current situation is invalid. I see three possible solutions also mentioned before:

  1. Remove the Send/Sync implementations and make bam::Record not shareable.
  2. Replace the Rc<HeaderView> with Arc<HeaderView> and always pay the performance penalty.
  3. Remove Rc<HeaderView> from Record. Breaking change, but introduced fairly recent.

Anyone see another way out of this? Do we take a vote? I'm on 1 or 3, but mostly 3.

@sitag
Copy link
Contributor

sitag commented Feb 20, 2021

@veldsla I agree. As it stands the Send/Sync are not valid. My primary vote is for 3 as well, then 1. There may be performance benefits from having a smaller bam::Record struct. HeaderView may be needed to get reference sequence name from tid so we will need to document how to get that, (c.f. #288 )
2 is a no go.

@veldsla
Copy link
Contributor

veldsla commented Feb 22, 2021

I looked at the implications of removal. Mostly straightforward, except that it breaks the bio_types::genome::AbstractInterval implementation that is present for bam::Record.

The implementation is not really that well suited anyway. contig() panics on no header or invalid UTF8. It also does UTF8 validation for every call to contig which makes it unsuitable for calling in a loop. range() panics when .cigar_cached() wasn't called on the record which is just bad form. (I also would still like to remove this cigar_caching as well, but let's postpone that discussion).

I have a branch with the changes, so a PR is trivial.

@johanneskoester, @pmarks, any thoughts?

@pmarks
Copy link
Contributor

pmarks commented Feb 22, 2021

@veldsla I wonder if we could collect more data on 2? My naive expectation is that it would be a negligible performance difference that would be difficult to observe. Might be interesting to do a benchmark before we rule out that option on the grounds of performance, but I do agree this would go against the "slow features should be opt-in" principle.

For example if decrementing the Arc is very fast compared to the rest of the Record::drop() impl, then it seems like just using an Arc makes the Record somewhat more usable.

I'm also OK with 3 if that's what people prefer. 1 is a definite no-go.

@veldsla
Copy link
Contributor

veldsla commented Feb 23, 2021

@pmarks I did some tests using a simple program that reads every record in a single threaded loop from a 1Gb 25M reads bam file. Indeed it seems that the performance difference in that scenario is not too big. Depending on the number of read threads I saw very little change using 4 threads:

Benchmark #1: ./htslibrctest
  Time (mean ± σ):      4.436 s ±  0.035 s    [User: 20.640 s, System: 0.630 s]
  Range (min … max):    4.392 s …  4.513 s    10 runs
 
Benchmark #2: ./htslibarctest
  Time (mean ± σ):      4.435 s ±  0.030 s    [User: 20.614 s, System: 0.651 s]
  Range (min … max):    4.379 s …  4.477 s    10 runs
 
Summary
  './htslibarctest' ran
    1.00 ± 0.01 times faster than './htslibrctest'

On a larger machine an upping the number of read threads to 20 a minor penalty becomes apparent:

Benchmark #1: ./htslibrctest
  Time (mean ± σ):      2.664 s ±  0.051 s    [User: 24.248 s, System: 1.336 s]
  Range (min … max):    2.578 s …  2.721 s    10 runs
 
Benchmark #2: ./htslibarctest
  Time (mean ± σ):      2.829 s ±  0.078 s    [User: 24.122 s, System: 1.295 s]
  Range (min … max):    2.737 s …  2.981 s    10 runs
 
Summary
  './htslibrctest' ran
    1.06 ± 0.04 times faster than './htslibarctest'

Of course this is only the case when the file is cached. In practice these operations will probably be mostly disk-bound. That doesn't mean the Arc has no overhead, but we're just limited by the reader.

@veldsla
Copy link
Contributor

veldsla commented Feb 23, 2021

Having seen these results, I still don't change my preference. I think the header has no useful place in the Record and the sacrificing of the AbstractInterval trait would be a good thing. Also when doing my naive s/Rc/Arc/ I saw: 17 occurrences of unsafe impl (Send|Sync) for structs that often contain Rc. So a decent unsafe audit is in order.

@sitag
Copy link
Contributor

sitag commented Feb 26, 2021

@pmarks @veldsla I don't think 25M reads represents the kind of work flow we deal with. For methylation data we regularly work with 1G reads; we expect data to get bigger not smaller. A couple of years back we looked at both a young rust-htslib, and a more mature bioD/sambamba, and went with bioD/sambamba because it was slightly more performant and that mattered, even though D came with tooling headaches. I think we should not sacrifice performance for ergonomics and abstraction.

@veldsla
Copy link
Contributor

veldsla commented Mar 1, 2021

I deliberately used a small file to make sure IO was less of an issue, but already it is clear that we are doing IO and that basically boils down to waiting. So unless you create your bam::Record's new from some other source the limiting factor will be the reading/block decompressing. We can still do a synthetic benchmark on a struct with an Arc vs a struct with (or without an Rc), and I'm sure the difference will be a lot more obvious, but in practice you probably won't observe it.

But yes, I think so too:

I think we should not sacrifice performance for ergonomics and abstraction

@DonFreed
Copy link
Author

DonFreed commented Mar 26, 2021

Reviving this issue again.

Personally, my vote is for 3. I would also think that 1 is a no-go as being unable to pass bam::Records between threads would limit the usefulness of the library. I'm not sure what the reasoning is for implementing bio_types::genome::AbstractInterval for bam::Records, so it's hard for me to weigh the merits of 2 vs. 3.

@jch-13
Copy link
Contributor

jch-13 commented Jun 4, 2021

Trying to revive again. What about adding a wrapper struct that contains a {bam,bcf}::Record and an Arc pointing to its header and only implementing AbstractInterval for that? So you'd pay only for (atomic) reference counting when you actually do need the header reference? So basically option 3 with a workaround for those who need AbstractInterval impls on records. It's not exactly elegant, though...

@tedil
Copy link
Member

tedil commented Jun 4, 2021

Another option would be to have the Arc/Send+Sync variant (or AbstractInterval?) behind a feature gate which isn't part of the default features.

@jch-13
Copy link
Contributor

jch-13 commented Jun 5, 2021

Good idea, but wouldn't that mean you can't use both AbstractInterval and Sync simultaneously?

@ebioman
Copy link

ebioman commented Nov 15, 2021

Hi reviving this as well again, is this still a problem nowadays?
I tried to pass records and rayon and I think I am running into a similar issues:

let thread_results : Vec<String> ;
let (snd, rxv) = unbounded();
 rayon::scope( |child| {
 for record in bam.records(){
     let snd_local = snd.clone();
     child.spawn( move |_| {
         let read = record.expect("ERROR: could not extract read information from record!");
         let tmp_result = str::from_utf8(read.qname()).unwrap().to_string();
         snd_local.send(tmp_result).expect("ERROR: thread could not communicate result!");
     });
    }
});
drop(snd);
thread_results = rxv.iter().collect();
dbg!(thread_results);

Executing it multiple times over a bam file succeeds in general but occasionally would generate a core dump

@ebioman
Copy link

ebioman commented Nov 15, 2021

Okay this is pretty strange as it works flawlessly if I pass a copy of the header inside via:

rayon::scope( |child| {
 for record in bam.records(){
     let snd_local = snd.clone();
     let _header_local = header.clone();
     child.spawn( move |_| {
         let read = record.expect("ERROR: could not extract read information from record!");
         let tmp_result  : String = str::from_utf8(read.qname()).unwrap().to_string();
         snd_local.send(tmp_result).expect("ERROR: thread could not communicate result!");
     });
    }
});
drop(snd);
thread_results = rxv.iter().collect();
dbg!(thread_results);



@veldsla
Copy link
Contributor

veldsla commented Nov 15, 2021

The Rc issues were never fixed. Somebody needs to pick a solution discussed above. The cloning of the header probably avoids the illegal access.

You could also consider switching to noodles. It provides (async!) reader/writers for BAM and CRAM (but no unified record). It's still actively changing, but releases on crates.io and has a very responsive maintainer.

@ebioman
Copy link

ebioman commented Nov 15, 2021

Thanks, I see. First I thought I would not mind at the moment just passing the header all the time as it worked well on small subsets.
Turns out though that this makes my RAM explode very quickly in large BAM files :/
And obviously if I just reference the header view instead of cloning it then it never succeeds through the entire file but will eventually hit that bug.

Update: Tried to pass through a chunk iterator instead of the records but problem remains.
I need tid and mtid and unfortunately subsequently the header. But it seems that whenever I try to do that I hit either the occasional Segmentation fault or if I clone the header I kill RAM :(

@Ebedthan
Copy link

Ebedthan commented Feb 1, 2022

Reviving this issue again.

Personally, I will go for 3. Can we have a solution for this?

@jch-13
Copy link
Contributor

jch-13 commented Feb 2, 2022

I made a draft PR (#345) trying to work around this particular soundness issue. Please let me know what you think.

sc13-bioinf added a commit to roland-rad-lab/CRISPR_rust that referenced this issue Mar 8, 2022
@brentp
Copy link
Contributor

brentp commented Dec 19, 2022

What's the consensus here? It seems that Arc (option 3) requires the least code-change and @veldsla has shown that the performance hit is tolerable.
Is there some concurrency setup where this is not an issue? @sitag recommends using rayon parallel iterator. Is there an exampl here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants