-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extended BAM support #129
Comments
A lot of the problems with tags can be solved with an implementation using the following guidelines.
So an implementation with a tag array and a value array (containing typecode and value) seems efficient as well as easily convertable to and from BAM format. Removing entries is simple: simply zero the entry. Make it so that zero entries are not written. Adding entries is also simple. Check if there are 0-entries (the tag array needs to be scanned anyway to see if the tag already exists) and overwrite them. If not make the arrays a little bigger. Replacing tags is also easy: overwrite them. This is not going to be trivial to implement, but at least it as an implementation that will work well and scale well. (Albeit using slightly more memory than the htslib implementation, which simply keeps the file format representation in memory). It is also easily possible in this format to selectively not support manipulating the tags array with every type code. What I mean is that of course dnaio should be able to correctly parse all tag typecodes as well as write them back properly, but support for adding new tags can be limited. For example the 'Z' typecode for strings will probably the most-used one by cutadapt (if not the only one) so that can be enabled. The same goes for reading typecodes. Sure it is not great that the implementation is not complete, but it allows for spreading the work over multiple releases and support as needed. Sorry, I had to write all this down, otherwise I can't stop thinking about it and start working on my other tasks :-). |
One of the most costly things in terms of performance is memory allocations. So it is a good thing to do these as little as possible. Unfortunately there are some variable length items in BAM file tags such as arrays and strings. This cannot be solved without dynamic memory allocation. However with a bit of organization this can be remedied. For the value store I originally intented the following layout:
With value being cast to a pointer to a dynamically allocated piece of memory in case of 'B' or 'Z' type tags. It is however much more efficient to allocate only a single piece of memory for all the B and Z tags. The value part of TagValue can then be cast to a Another question is whether to store tags and values separate or together. Also the following struct can be used
That saves allocating one array, at the cost of fast tag contains checking. I prefer keeping the fast check. So implementation wise the following members would need to be added to the SequenceRecord struct
That is 32 bytes extra to sequence record. In my opinion this is acceptable. Then we need to think about how to get and set tags. I think the following signatures are appropriate:
Yes this mimicks the pysam API. I think they made a pretty good design choice there. Setting multiple tags simultaneously or worrying about value_type seems to me to be beyond the scope of dnaio. No, it does not use numpy arrays. Numpy arrays are only efficient on quite large arrays as they have tremendous overhead. As well as introducing a massive dependency to the project. For arrays I am quite conflicted on whether to use memoryview (a type that when iterated over gives the correct results!) or array. On Cython using array.array is easier than when using native C. I guess a good solution for now is to not support |
See https://github.com/biopython/biopython/blob/master/Bio/bgzf.py for a possibly useful BGZF implementation available under the 3-clause BSD license. I think that'd let you vendor that file (bundle it as it within the MIT licensed dnaio), since adding a dependency on the whole of Biopython might not be tempting. |
Thanks! Since only streaming is supported some shortcuts can be taken. An io.BufferedWriter with a buffersize of 65200 can be used to write to a RawBgzipWriter that will simply take any writes, compresses them and enclose them in a BGZF header and footer. BGZIP streaming is already supported since that can already be handled by normal gzip readers. This would grant the required amount of BGZF support with very little extra code. I think dnaio can provide value over pysam by providing limited support for only a few BAM features as they are used in uBAM. That allows for a much cleaner interface whilst also being a lot faster. Full BAM and BGZF support including indexing will require lots of extra code, and is not really of value for uBAM files (as they come from the sequencer). In use cases where it is needed I think using native htslib bindings such as pysam are a better option. |
Hi all, has there been any movement on this front? We would love for cutadapt to be compatible with bam files. |
dnaio already supports reading BAM single-end files (#109). This issue is about improving the support in a couple of ways. What would you need most? |
Whoops! I thought I was on the cutadapt page. I'm looking to search for
specific adapter sequences in my aligned bam files.
…On Wed, Sep 25, 2024 at 10:00 PM Marcel Martin ***@***.***> wrote:
dnaio already [supports reading BAM single-end files)(#109
<#109>). This issue is about
improving the support in a couple of ways. What would you need most?
—
Reply to this email directly, view it on GitHub
<#129 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACPHYTYERP6IR6VIDORHTYTZYOIILAVCNFSM6AAAAABJDPY6ECVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZVHA3TINRTGQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Cutadapt delegates reading and writing sequences to dnaio, so discussing this particular aspect is ok here. That said, do you have a concrete idea of how searching for adapters in already aligned BAM files should work or is this more a general feature suggestion? I expect that any decent read mapper soft-clips adapter sequences, so I would usually not expect this to be so useful. Feel free to open an issue in the Cutadapt repository to discuss this aspect. |
@billytcl This would pose technical challenges for dnaio as paired-end reads are generally not mapped at the same position. So this prohibits streaming. This means indexed reading support needs to be added. Currently my plan is to integrate as much BAM features as needed to fully support ONT uBAM files as created by dorado. Or maybe single-end libraries for aligned BAM. Integrating more features would simply mean creating an alternative version of PySAM, which has two problems:
Trust me, I have tried. I broke off that project because getting CRAM support, support for indexed reads etc. was so much effort, that it would be better just to use htslib than to create a native python library. But that project already exists, PySAM. Then there is the issue that trimming after the alignment is just the wrong order of doing things. The adapters do affect the alignment score and the placement of the reads. So you have badly aligned reads with adapters. Just trimming them gives badly aligned reads. In your case I recommend using |
Note to self: Cutadapt also delegates trimming to dnaio, as SequenceRecord objects can be sliced. I want to properly slice tags. The dorado implementation can be found here. https://github.com/nanoporetech/dorado/blob/c3a2952356e2506ef1de73b0c9e14784ab9a974a/dorado/demux/Trimmer.cpp#L127 |
@marcelm Would dnaio support cram files coming out of an Ultima sequencer or do the files need to be converted to fastq beforehands? |
CRAM files are very complicated. https://github.com/samtools/hts-specs/blob/master/CRAMv3.pdf . As such these files are better parsed using htslib or bindings thereof. I recommend pysam for that use case. Alternatively |
As discussed in marcelm/cutadapt#757 adding BAM support to dnaio would have several advantages for cutadapt. The alternative is to defer to pysam instead, and keep dnaio at its basic level.
In order for full BAM support I think the following things need to be added:
Converting a SequenceRecord object to binary BAM bytes, is something where I see no technical difficulties. A BAM header is also not verry difficult as it is just a SAM header with a few binary quirks.
BGZF is going to be a slight hurdle as it cannot be appropriately done with xopen, so direct library calls to zlib/libdeflate/isa-l are needed. Since only streamed writing is required it is actually quite easy to write a RawWriter that creates the BGZF blocks and wrap that in an io.BufferedWriter that has a 65200 byte buffer size. That will automatically take care of the sizing and will also be pretty fast.
Tag support is really tough. There are several options here:
On top of that, how do you design the API for manipulating the tags?
Then there is also paired end support. It is possible to put some limits on what is supported. Name-sorted files come to mind. Enabling full support requires supporting indexes and that is also a lot of work. In that case pysam is a better option.
The text was updated successfully, but these errors were encountered: