Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip] rust-htslib integration #11

Closed
wants to merge 12 commits into from
Closed

Conversation

eharr
Copy link
Contributor

@eharr eharr commented Jan 10, 2023

@jguhlin I think it would useful to have some functionality to write the mappings obtained to a BAM file. Unfortunately right now there seems to be a conflict between the dependencies for rust-htslib and minimap2-rs (probably flate2). To narrow down the conflict I created this branch with some unfinished htslib conversion code behind a htslib feature flag and I also moved the the map_file code behind a map-file feature flag. Cargo tests pass with either of these flags switched on, but when I enable both (cargo test --features map-file,htslib) I get the error listed below.

This PR prompts several questions:

  1. Do you have any ideas on how to fix the error below? It looks like there are multiple versions of zlib available to the linker. It could potentially impact any downstream crate that has both minimap2-rs and rust-htslib as dependencies.
  2. Would you be okay with putting the map_file code behind a feature flag? It's certainly useful but with the additional dependencies it might make more sense as an opt-in
  3. Do you think that the htslib conversion code in the PR should live in the minimap2-rs crate or elsewhere?

Best regards,
Eoghan

Compiling minimap2-sys v0.1.7
Compiling minimap2 v0.1.9 (/mmfs1/uslinuxhome/eharrington/p/public/minimap2-rs)
...
...
error: linking with cc failed: exit status: 1
|
= note: "cc" "-m64" ...
= note: /.../minimap2-rs/target/debug/deps/liblibz_ng_sys-e0644fd88f36dad5.rlib(deflate.c.o): In function 'read_buf':
/.../.cargo/registry/src/github.com-1ecc6299db9ec823/libz-ng-sys-1.1.8/src/zlib-ng/deflate.c:1073: multiple definition of 'read_buf'
/.../p/public/minimap2-rs/target/debug/deps/liblibz_sys-b8ad998ce79bc021.rlib(deflate.c.o):/mmfs1/uslinuxhome/eharrington/.cargo/registry/src/github.com-1ecc6299db9ec823/libz-sys-1.1.8/src/zlib-ng/deflate.c:1073: first defined here
/.../p/public/minimap2-rs/target/debug/deps/liblibz_ng_sys-e0644fd88f36dad5.rlib(deflate.c.o): In function 'deflateStateCheck':
/.../.cargo/registry/src/github.com-1ecc6299db9ec823/libz-ng-sys-1.1.8/src/zlib-ng/deflate.c:327: multiple definition of 'fill_window'
collect2: error: ld returned 1 exit status
error: could not compile minimap2 due to previous error

eharr added 2 commits January 10, 2023 15:32
WIP to convert Mapping to a bam record, can be enabled with the `htslib`
feature flag. In order to get it to compile on linux I needed to move
the flate2-dependent code behind a `map-file` feature flag.
@jguhlin
Copy link
Owner

jguhlin commented Jan 10, 2023

I'm not opposed to splitting functionality behind features, or even going to a 3-crate system (like ztd has -sys, -safe, and -rs). I think the map-file should be a default-feature for now, but we can cross that bridge a bit later, I'll think on it a bit.

For your direction questions:

  1. No, this is my first FFI project, and I'm glad it hasn't collapsed yet! I'll dig into it though, it's definitely interfering with htslib's libz and here. Maybe the answer is to drop flate itself and switch to libz? I'll work on it a bit.

My real preference would be to only provide buffers and functions to handle buffers, but that makes it harder for people to casually use the library.

  1. Yep (as above).

  2. I think it's good. If it gets too large it can always be split off later.

Unrelated:
Looking at cargo tree maybe we should drop bytelines, or re-implement it, it's more than half the dependency tree. But it should be giving some speed.

@jguhlin
Copy link
Owner

jguhlin commented Jan 10, 2023

I've got it compiled but can't run the test, as it's data you haven't committed yet. Can you give it a try and let me know how it goes? You may need to run cargo update.

It's also switched over to zlib instead of zlib-ng, but someone could use zlib-ng-compat feature of flate2 and that should replace zlib with zlib-ng.

@eharr
Copy link
Contributor Author

eharr commented Jan 11, 2023

Thanks for the quick response! It passed the tests, I checked in the test file in case you need it.

@eharr
Copy link
Contributor Author

eharr commented Jan 11, 2023

Regarding the overall package scope I think you're right that operating strictly on buffers might be a challenge for some (me included!).

There's a pattern that I've seen in rust-htslib where they use an inner stuct member to access the underlying buffer and then build accessors over it (https://github.com/rust-bio/rust-htslib/blob/81e4b851499960f2f03693ae84d055912adfbc58/src/bam/record.rs#L52). They also provide an ext.rs module with higher-level functions. This pattern might also have some bearing on threading (rust-bio/rust-htslib#293). This is just an observation - not a suggestion :).

@jguhlin
Copy link
Owner

jguhlin commented Jan 11, 2023

Great. I can merge this in when you feel it is ready. Could you add some docs to the functions as well? I'm trying to get better with that as well.

Is htslib the best name for it, or maybe under utils or something similar? Just thinking out loud here..

@eharr
Copy link
Contributor Author

eharr commented Jan 11, 2023

Great - I'll work on finishing it up tomorrow and let you know. I think htslib works as there might come a point where someone might want similar functionality for the noodles crate.

@jguhlin
Copy link
Owner

jguhlin commented Jan 11, 2023

Sounds good. Let me know if I can help. I'm having a bit of a dizzy spell today so not working much, but happy to help out later this week if needed. Cheers. :)

eharr added 3 commits January 11, 2023 08:15
Added a synthetic dataset to test some of the different aligment types: forward, reverse, primary, secondary, supplementary, unmapped, spliced
@eharr
Copy link
Contributor Author

eharr commented Jan 12, 2023

I just pushed a commit with a more expansive test dataset that should hopefully make it easier to spot differences in output between the cli minimap sam file and the one we're generating. I created synthetic reads that generate spliced, primary, secondary, supplementary aligments when mapped by minimap2-cli.

I hit a bug (I think) when comparing the strands of a read that should map in the reverse orientation (perfect_read.rev in the test set). The result of reg.rev() is 0 suggesting that it's mapping to the forward strand, but it really should be the reverse. Would you mind taking a look to ensure that I haven't missed something obvious?

Also would you like me to add a PAF file to the test dataset? It should be relatively straight forward.

@jguhlin
Copy link
Owner

jguhlin commented Jan 17, 2023

Hey, thanks for the extra tests! A PAF would be great.

Digging into the rev() function now.

@jguhlin
Copy link
Owner

jguhlin commented Jan 17, 2023

I added a test for detecting reverse complement strands, and it seems to work. Can you write up a test for the place you are having the problem?

You can add this to the test_mappings test

        // This should be reverse strand
        let mappings = aligner.map("TTTTGCATCGCTGAAAACCCCAAAGTATATTTTAGAACTCGTCTATAGGTTCTACGATTTAACATCCACAGCCTTCTGGTGTCGCTGGTGTTTCAAACACCTCGATATATCACTCCTTCTGAATAACATCCATGAAAGAAGAGCCCAATCCATACTACTAAAGCTATCGTCATATGCACCATGGTCTTTTGAGAAAATTTTGCCCTCTTTAATTGACTCTAAGCTAAAAAAGAAAATTTTAATCAGTCCTCAAATTACTTACGTAGTCTTCAAATCAATAAACTATATGATAACCACGAATGACGATAAAATACACAAGTCCGCTATTCCTTCTTCTTCCTCTCTACCGT".as_bytes(), false, false, None, None).unwrap();
        println!("Reverse Strand\n{:#?}", mappings);
        assert!(mappings[0].strand == Strand::Reverse);      

@eharr
Copy link
Contributor Author

eharr commented Jan 18, 2023

Thanks - I'll add that test in. Having spent a bit of time looking at the internals of minimap it seems like the best way to get a properly formatted SAM record is to use the mm_write_sam3 function and then use htslib to parse the resulting string. This is the approach rust-bwa takes. I'll try to push a cleaned up version of this branch as soon as I can.

@jguhlin
Copy link
Owner

jguhlin commented Jan 18, 2023

Yeah, this is for sure a very light wrapper over minimap2. I wonder if I can rewrite the worker_pipeline function to remove the dependence on kthreads/pthreads (so it could then compile on windows/wasm) but that's a different thread.

I've got mm2-fast as a separate backend on the main branch, will be curious to see if that works with your changes.

@jguhlin jguhlin closed this Jan 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants