-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dorado demux unexplained unclassified read #1124
Comments
Hi @GeorgescuC, I can see from the code that we look for the best top and bottom barcode results, which in the first case should be barcodes 1 (top penalty=1) and 3 (bottom penalty = 10). But then we compare the best overall penalties of those two barcodes - this is still the top penalty for barcode 1, but it's also the top penalty (=4) for barcode 3. So our best bottom results is actually pretty confidently predicting barcode3, but for the top rather than the bottom. What I'm not understanding is why this then fails the check, as you say your This all works out ok in the passing case as the best bottom penalty is 13 in both barcode 1 and barcode 3, so our The general issue is that the bottom barcodes are generating such terrible scores - I think there's a tacit assumption that the best bottom penalty barcode will have a better bottom penalty than it does a top penalty, but that's not the case here. Can I check that you've reverse-complemented the sequences for If that's not the issue, then something looks odd. Are you able to provide us with your barcode arrangement, sequences and the sample read? |
Hi @malton-ont , Thank you for the quick reply. I have attached an archive with the kit description (toml and fasta files) and example read in my original post at the end of the run environment section, but here is a recap of the masks used:
Given that the documentation page you linked to indicates that mask2 sequences are RC when running in double-ended barcode mode and states "For double-ended barcodes which are symmetric, the flank and barcode sequences for front and rear windows are same.", I did not RC them myself. This seems like it is working as I understood it given the trace log that aligns the read to the bottom flank sequence For reference, another of my tests involved running with mask2 sequences being manually RC-ed and again running in double ended barcode mode, but that reduced the classification rate from 65% to 60.5%. For the classification scoring itself, given my understanding of how the scoring was going to work, there are multiple steps where the bottom strand barcode should not have been called as confident. Here is how I expected the algorithm to operate once the flank alignments have been set, and where an issue could be:
Assuming a case where barcodes 11 and 13 were not a tie with barcode 03 on the bottom strand, I would then have assumed this logic (keeping other scores the same):
I would also further have assumed that since the documentation states Best, |
Hi @GeorgescuC, The read you have provided classifies correctly as That read also doesn't appear to have a rear barcode at all - are you certain the barcodes are on both ends? Dorado is expecting to look for a sequence like:
(and the reverse strand) If this is correct, I suspect that the adapter trimming during basecalling has been over-zealous - you should try rebasecalling with the Your description of the algorithm is not quite correct. For double ended barcodes, dorado:
** this is the suspicious part - it doesn't require that the confident prediction is the top or bottom of the respective result so if the best bottom result is bad but has a highly confident (but not the best) top penalty we may still declassify it. |
Hi @malton-ont , The 2 configurations that I provided the run logs for use the same masks and so both cases have Regarding the read structure, the barcode can be present on either end of the read, but will only be present on one end. The molecules here are double stranded cDNA and the barcode being demuxed comes from the original RNA capture (comparable to an UMI in its location, although it is not one). This is a sub-barcode, so the ONT adapter/barcode were added to the cDNA for sequencing and then used for basecalling + demuxing. I then took the uBAM of the ONT barcode for this library and ran demuxing with the provided kit configs. Basically, these reads can have either of these 2 structures (ignoring the ONT adapters and barcodes trimmed in the first demux round): I understand that using this config ideally works best when the barcode is on both sides, but given the size of the masks and flanks used, I thought the penalty scores of the side that does not have the barcode should always be so bad that it is never considered as a good enough match, given I also thought a better approach might be to first look for front barcodes only, then for rear barcodes only among the unclassified reads, as this would also help with reorienting the reads where the barcode was found at the rear. However the tests I ran did not give me confidence in the results.
If you think it would be easier or more helpful, I can privately share the whole uBAM used in these tests (1,224,219 reads for slightly below 1GB file size) Small note on the verbose debugging, it would be nice if the log indicated what read is currently being aligned/scored for all reads and at the beginning of that read's log. Best, |
Hi @GeorgescuC,
Are these both possible template strands? Given your mask1 and mask2 are identical, if the second strand here is meant to represent a complement strand then this should be treated as a single ended barcode - i.e. you should remove mask2 from your config. By providing both sets of masks you are telling dorado that it expects a barcode at both ends, not at either end. If this is genuinely a prep where the barcode can ligate to either end of the template strand, dorado does not support this as a standard configuration. I'm surprised that running single-ended on the front mask and then single ended with So I would expect these configs to work:
Gets you your initial 35%. Then, on the remaining unclassified reads:
It sounds like your main issue with this method is that you are getting a higher proportion of reads with the barcode at the rear? I'm not sure I can comment on how likely that is based on your sample prep. |
Hi @malton-ont , Yes I removed mask2 in the instances where searching for only on one side. I also extended the barcodes to have both their normal version and RC version in some of the tests, and that works as expected. The configs I used are: 5'-search forward-masks:
3'-search forward-masks (expecting to not find anything):
3'-search rc-masks:
For all these tests, I ran using the full uBAM of the ONT barcode so classification rates are comparable.
There seem to be a few issues that may be compounding so I originally made this issue about a single one of them to try to figure out things step by step, but then mentioned the others as well:
I will contact your support team then. I think it will much easier to figure out what is going wrong (either an error on my side or a bug) with all files in hand. I will include all the configs I used and a short list of commands I ran to quickly reproduce. Best, |
Hi @malton-ont , I have contacted your support team, who provided me with a link to share the data. I have now uploaded the full ubam, kit configuration files, and list of commands to reproduce the issue there. Best, |
I've received the files, thanks. From a preliminary investigation, it appears that your classification rates of:
are correct - performing the RC3p classification on the remaining unclassified reads from the FF5p gives a classification rate of 75.7% which, when applied to the 64.6% unclassified fraction, gives 48.9%. This matches the RC3p run on the full dataset. I believe the reason you are seeing an identical classification rate between your FF5p and FF3p configurations is because you have incorrectly specified I've determined that we can make a change in dorado such that your FF5pRC3p arrangement would classify all of these reads in a single pass with a classification rate of 84.2% (matching the individual runs). I also verified that running with this change and adding We'll aim to get this change out in a future release. |
Hi @malton-ont , For the 3p versions I see it now, there was no log mentioning it so I did not notice (since other errors usually led to a segfault). It also explains why making the front window size 0 made that version segfault. Being able to classify both with a single pass would be nice. Would it be possible to mention in a tag in the BAM which side the barcode was found on? That would allow easily reorienting the reads if desired. If not the two-pass single end will still be useful. It looks like the changes you made are in the master branch, are these all? If so I will try to build from it (unless you have a build you can share) and see if specific example reads that I was looking at classify as expected now. Thank you for your help, |
Hi @GeorgescuC, Unused entries in a .toml file are not strictly an error, so we couldn't detect the incorrect keyword I'm afraid. We can tighten the validation on the windows though to prevent invalid values. We've had other requests to extend the information available regarding the barcode classification. I'll raise this internally. Yes, those changes are the ones I used to generate the numbers above. You should be able to build from that branch. One other thought occurs to me. It looks like you may have done a previous round of demuxing to handle a second set of barcodes? If so, you may want to try doing that demux with the |
Hi @malton-ont , I will try to build from that branch then and look through this data again and let you know what I find. I did also actually rerun basecalling with the Best, |
Issue Report
Please describe the issue:
Hi, I was testing different configurations to demux non-ONT barcodes in order to find optimal scoring and window sizes settings, and found inconsistencies in the number of classified/unclassified reads between kit configurations, so I started looking at some examples and found cases that I cannot explain.
The reads can have the barcode I am looking for on either side but they are the same sequences so the adapter and barcode sequences are the same. The barcodes are 7 bp long with >=3 edit distance and after noticing some reads where the barcode was slightly further inside the sequence than the 175bp rear_barcode_window I used at first, I tried increasing that to 250, which made some of the previously classified reads to become unclassified. In the example read used further down, the dorado verbose log indicates that this is due to both ends confidently predicting different barcodes, but the penalties for the barcodes given are 1 for the top strand vs 10 for the bottom strand while my min_barcode_penalty_dist is only 3.
Steps to reproduce the issue:
Run the dorado demux commands provided below with the files attached.
Run environment:
dorado_issue.zip
Logs
The text was updated successfully, but these errors were encountered: