Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dorado demux unexplained unclassified read #1124

Open
GeorgescuC opened this issue Nov 8, 2024 · 11 comments
Open

Dorado demux unexplained unclassified read #1124

GeorgescuC opened this issue Nov 8, 2024 · 11 comments
Assignees
Labels
barcode Issues related to barcoding

Comments

@GeorgescuC
Copy link

Issue Report

Please describe the issue:

Hi, I was testing different configurations to demux non-ONT barcodes in order to find optimal scoring and window sizes settings, and found inconsistencies in the number of classified/unclassified reads between kit configurations, so I started looking at some examples and found cases that I cannot explain.
The reads can have the barcode I am looking for on either side but they are the same sequences so the adapter and barcode sequences are the same. The barcodes are 7 bp long with >=3 edit distance and after noticing some reads where the barcode was slightly further inside the sequence than the 175bp rear_barcode_window I used at first, I tried increasing that to 250, which made some of the previously classified reads to become unclassified. In the example read used further down, the dorado verbose log indicates that this is due to both ends confidently predicting different barcodes, but the penalties for the barcodes given are 1 for the top strand vs 10 for the bottom strand while my min_barcode_penalty_dist is only 3.

Steps to reproduce the issue:

Run the dorado demux commands provided below with the files attached.

Run environment:

  • Dorado version: 0.8.2
  • Dorado command:
    • dorado-0.8.2-linux-x64/bin/dorado demux -vv --output-dir debug_new_unclassified_r250 debug_new_unclassified_r250.sam --barcode-arrangement kit_adapter_config_r250.toml --barcode-sequences kit_barcodes.fasta --no-trim
    • dorado-0.8.2-linux-x64/bin/dorado demux -vv --output-dir debug_new_unclassified debug_new_unclassified_r250.sam --barcode-arrangement kit_adapter_config.toml --barcode-sequences kit_barcodes.fasta --no-trim
  • Operating system: Ubuntu
  • Hardware (CPUs, Memory, GPUs): AMD Ryzen 9 7950X, 64GB RAM, NVIDIA RTX4090
  • Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): SAM
  • Source data location (on device or networked drive - NFS, etc.): on device
  • Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB): 1 already basecalled read
  • Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue): the attached zip archive contains a 1 read SAM file, and the kit configuration files used.
    dorado_issue.zip

Logs

  • Log for the configuration that does not work as expected:
dorado-0.8.2-linux-x64/bin/dorado
[2024-11-07 17:29:08.380] [info] Running: "demux" "-vvv" "--output-dir" "debug_new_unclassified_r250" "debug_new_unclassified_r250.sam" "--barcode-arrangement" "kit_adapter_config_r250.toml" "--barcode-sequences" "kit_barcodes.fasta" "--no-trim"
[2024-11-07 17:29:08.380] [trace] AlignmentProcessingItems::initialise [ENTRY]
[2024-11-07 17:29:08.380] [trace] AlignmentProcessingItems::initialise [EXIT]
[2024-11-07 17:29:08.380] [info] num input files: 1
[2024-11-07 17:29:08.380] [debug] > barcoding threads 29, writer threads 3
[2024-11-07 17:29:08.416] [info] > starting barcode demuxing
[2024-11-07 17:29:08.416] [debug] Total reads processed: 1
[2024-11-07 17:29:08.416] [trace] pushed to pipeline: 1
[2024-11-07 17:29:08.418] [debug] > Kits to evaluate: 1
[2024-11-07 17:29:08.418] [trace] query cursor 29 target cursor 26
[2024-11-07 17:29:08.418] [trace] midstrand flank top dist 17 position 129 bc_loc 155 score 0.60465115
[2024-11-07 17:29:08.418] [trace]
CTACACGACGCTCTTCCGATCTNNNNNNNGCAATGAAGTCGCAGGGTTGG
| |*||||*| ||**| ||||*|||||||*|*||*||*||* |   ||||
C-AGACGAGG-TCAAC-GATCCCTCCCTTACCATCAAATCA-A---TTGG
[2024-11-07 17:29:08.418] [trace] query cursor 28 target cursor 27
[2024-11-07 17:29:08.418] [trace] midstrand flank bottom dist 16 position 261 bc_loc 288 score 0.627907
[2024-11-07 17:29:08.418] [trace]
CCAACCCTGCGACTTCATT-GCNNNNNNNAGATCG-GAAG-AGCGTCGTGTAG
*|*||| |||||||*| || ||||||||| ||||| |*|| | | ||**| |
GCGACC-TGCGACTCC-TTAGCATTTGAC-GATCGAGTAGTA-C-TCCCG-A-
[2024-11-07 17:29:08.418] [trace] query cursor 29 target cursor 23
[2024-11-07 17:29:08.418] [trace] top score dist 8 position 0 bc_loc 23 score 0.8139535
[2024-11-07 17:29:08.418] [trace]
CTACACGACGCTCTTCCGATCTNNNNNNNGCAATGAAGTCGCAGGGTTGG
  || |*|| ||*  |||||||||||||||||||||||||||||||||||
--AC-CAAC-CTA--CCGATCTACTACACGCAATGAAGTCGCAGGGTTGG
[2024-11-07 17:29:08.418] [trace] query cursor 28 target cursor 24
[2024-11-07 17:29:08.418] [trace] bottom score dist 19 position 29 bc_loc 53 score 0.55813956
[2024-11-07 17:29:08.418] [trace]
CCAACCCTGCGACTT-CATTGCNNNNNNNAGATCGGAAGAGCGTCGTGTAG
||||*||  || ||| ||*  |||||||| ||*|||  | | ||* |  |
CCAAACC--CG-CTTTCAC--CGCTACAC-GACCGG--G-G-GTA-T--A-
[2024-11-07 17:29:08.418] [trace] Checking barcode kit_bc01
[2024-11-07 17:29:08.418] [trace] top window 1
[2024-11-07 17:29:08.418] [trace]
CTTCCGATCTACTACACGCAATGA
||*|||||||||||||||||||||
CTACCGATCTACTACACGCAATGA
[2024-11-07 17:29:08.418] [trace] bottom window 13
[2024-11-07 17:29:08.418] [trace]
A-CTT-CATTGCGTGTAGTAGATCGG-
* ||| ||**|| |**|*  ||*|||
CGCTTTCACCGC-TACAC--GACCGGG
[2024-11-07 17:29:08.418] [trace] Checking barcode kit_bc02
[2024-11-07 17:29:08.418] [trace] top window 4
[2024-11-07 17:29:08.418] [trace]
CTTCCGATCTACTAGTA-GCAATGA
||*|||||||||||* | |||||||
CTACCGATCTACTAC-ACGCAATGA
[2024-11-07 17:29:08.418] [trace] bottom window 11
[2024-11-07 17:29:08.418] [trace]
A-CTT-CATTGCTACTAGTAGATCGG-
* ||| ||**||||| |*  ||*|||
CGCTTTCACCGCTAC-AC--GACCGGG
[2024-11-07 17:29:08.418] [trace] Checking barcode kit_bc03
[2024-11-07 17:29:08.418] [trace] top window 4
[2024-11-07 17:29:08.418] [trace]
CTTCCGATCTAGTGTACGCAATGA
||*||||||||*|**|||||||||
CTACCGATCTACTACACGCAATGA
[2024-11-07 17:29:08.418] [trace] bottom window 10
[2024-11-07 17:29:08.418] [trace]
A-CTT-CATTGCGTACACTAGATCGG-
* ||| ||**|| |||||  ||*|||
CGCTTTCACCGC-TACAC--GACCGGG
[2024-11-07 17:29:08.418] [trace] Checking barcode kit_bc04
[2024-11-07 17:29:08.418] [trace] top window 5
[2024-11-07 17:29:08.418] [trace]
CTTCCGATCTA-T-CACTAGCAATGA
||*|||||||| | |||  |||||||
CTACCGATCTACTACAC--GCAATGA
[2024-11-07 17:29:08.418] [trace] bottom window 12
[2024-11-07 17:29:08.418] [trace]
A-CTT-CATTGCTAGTGATAGATCGG-
* ||| ||**||||*  |* ||*|||
CGCTTTCACCGCTAC--AC-GACCGGG
[2024-11-07 17:29:08.418] [trace] Checking barcode kit_bc05
[2024-11-07 17:29:08.418] [trace] top window 7
[2024-11-07 17:29:08.418] [trace]
CTTCCGATCT-C-AGCTGTGCAATGA
||*||||||| | | |** |||||||
CTACCGATCTACTA-CAC-GCAATGA
[2024-11-07 17:29:08.418] [trace] bottom window 11
[2024-11-07 17:29:08.418] [trace]
ACTTCATTGCACAGCTG-A-GATCGG-
 |* |*|| |||*|||* | ||*|||
-CG-CTTT-CACCGCTACACGACCGGG
[2024-11-07 17:29:08.418] [trace] Checking barcode kit_bc06
[2024-11-07 17:29:08.418] [trace] top window 4
[2024-11-07 17:29:08.418] [trace]
CTTCCGATCTCAGT-CACGCAATGA
||*||||||| |*| ||||||||||
CTACCGATCT-ACTACACGCAATGA
[2024-11-07 17:29:08.418] [trace] bottom window 12
[2024-11-07 17:29:08.418] [trace]
A-CTT-CATTGCGTGACTGAGATCGG-
* ||| ||**|| | ||** ||*|||
CGCTTTCACCGC-T-ACAC-GACCGGG
[2024-11-07 17:29:08.418] [trace] Checking barcode kit_bc07
[2024-11-07 17:29:08.418] [trace] top window 6
[2024-11-07 17:29:08.418] [trace]
CTTCCGATCT-CATGTATGCAATGA
||*||||||| | |**|*|||||||
CTACCGATCTAC-TACACGCAATGA
[2024-11-07 17:29:08.418] [trace] bottom window 11
[2024-11-07 17:29:08.418] [trace]
A-CTT-CATTGCATACATGAGATCGG-
* ||| ||**|| ||||*||*  |||
CGCTTTCACCGC-TACACGAC--CGGG
[2024-11-07 17:29:08.418] [trace] Checking barcode kit_bc08
[2024-11-07 17:29:08.418] [trace] top window 6
[2024-11-07 17:29:08.418] [trace]
CTTCCGATCT-CGTATGTGCAATGA
||*||||||| | ||***|||||||
CTACCGATCTAC-TACACGCAATGA
[2024-11-07 17:29:08.418] [trace] bottom window 12
[2024-11-07 17:29:08.418] [trace]
A-CTT-CATTGC-ACATACGAGATCGG-
* ||| ||**|| |||  |||*  |||
CGCTTTCACCGCTACA--CGAC--CGGG
[2024-11-07 17:29:08.418] [trace] Checking barcode kit_bc09
[2024-11-07 17:29:08.418] [trace] top window 6
[2024-11-07 17:29:08.418] [trace]
CTTCCGATCTGAC-ATGTGCAATGA
||*||||||| || |***|||||||
CTACCGATCT-ACTACACGCAATGA
[2024-11-07 17:29:08.418] [trace] bottom window 12
[2024-11-07 17:29:08.418] [trace]
A-CTT-CATTGC-ACATGTCAGATCGG-
* ||| ||**|| |||   | ||*|||
CGCTTTCACCGCTACA---C-GACCGGG
[2024-11-07 17:29:08.418] [trace] Checking barcode kit_bc10
[2024-11-07 17:29:08.418] [trace] top window 6
[2024-11-07 17:29:08.418] [trace]
CTTCCGATCTGAGT-CTA-GCAATGA
||*||||||| |*| | | |||||||
CTACCGATCT-ACTAC-ACGCAATGA
[2024-11-07 17:29:08.418] [trace] bottom window 11
[2024-11-07 17:29:08.418] [trace]
A-CTT-CATTGCTAGACTCAGATCGG-
* ||| ||**||||*||   ||*|||
CGCTTTCACCGCTACAC---GACCGGG
[2024-11-07 17:29:08.418] [trace] Checking barcode kit_bc11
[2024-11-07 17:29:08.418] [trace] top window 6
[2024-11-07 17:29:08.418] [trace]
CTTCCGATCTG-TAGATAGCAATGA
||*|||||||* ||*|* |||||||
CTACCGATCTACTACAC-GCAATGA
[2024-11-07 17:29:08.419] [trace] bottom window 10
[2024-11-07 17:29:08.419] [trace]
A-CTT-CATTGCTATCTACAGATCGG-
* ||| ||**|||| | || ||*|||
CGCTTTCACCGCTA-C-AC-GACCGGG
[2024-11-07 17:29:08.419] [trace] Checking barcode kit_bc12
[2024-11-07 17:29:08.419] [trace] top window 5
[2024-11-07 17:29:08.419] [trace]
CTTCCGATCTG-TATGACGCAATGA
||*|||||||* ||* |||||||||
CTACCGATCTACTAC-ACGCAATGA
[2024-11-07 17:29:08.419] [trace] bottom window 11
[2024-11-07 17:29:08.419] [trace]
A-CTT-CATTGCGTCATACAGATCGG-
* ||| ||**|| | |*|| ||*|||
CGCTTTCACCGC-T-ACAC-GACCGGG
[2024-11-07 17:29:08.419] [trace] Checking barcode kit_bc13
[2024-11-07 17:29:08.419] [trace] top window 7
[2024-11-07 17:29:08.419] [trace]
CTTCCGATCTG-TGTAGTGCAATGA
||*|||||||* |**|* |||||||
CTACCGATCTACTACAC-GCAATGA
[2024-11-07 17:29:08.419] [trace] bottom window 10
[2024-11-07 17:29:08.419] [trace]
A-CTT-CATTGCACTACACAGATCGG-
* ||| ||**||  ||||| ||*|||
CGCTTTCACCGC--TACAC-GACCGGG
[2024-11-07 17:29:08.419] [trace] Checking barcode kit_bc14
[2024-11-07 17:29:08.419] [trace] top window 5
[2024-11-07 17:29:08.419] [trace]
CTTCCGATCTTACTAGT-GCAATGA
||*||||||| ||||** |||||||
CTACCGATCT-ACTACACGCAATGA
[2024-11-07 17:29:08.419] [trace] bottom window 12
[2024-11-07 17:29:08.419] [trace]
ACTTCATTGCACTAG-TA-A-GATCGG-
 |* |*|| |||* | || | ||*|||
-CG-CTTT-CACC-GCTACACGACCGGG
[2024-11-07 17:29:08.419] [trace] Checking barcode kit_bc15
[2024-11-07 17:29:08.419] [trace] top window 4
[2024-11-07 17:29:08.419] [trace]
CTTCCGATCTG-TACACTGCAATGA
||*|||||||* ||||| |||||||
CTACCGATCTACTACAC-GCAATGA
[2024-11-07 17:29:08.419] [trace] bottom window 11
[2024-11-07 17:29:08.419] [trace]
ACTTCATTGCAGTG-TACA-GATCGG-
 |* |*|| ||**| |||| ||*|||
-CG-CTTT-CACCGCTACACGACCGGG
[2024-11-07 17:29:08.419] [trace] Checking barcode kit_bc16
[2024-11-07 17:29:08.419] [trace] top window 6
[2024-11-07 17:29:08.419] [trace]
CTTCCGATCTTAGTG-ATGCAATGA
||*||||||| |*|* |*|||||||
CTACCGATCT-ACTACACGCAATGA
[2024-11-07 17:29:08.419] [trace] bottom window 11
[2024-11-07 17:29:08.419] [trace]
ACTTCATTGCATCACTA-A-GATCGG-
 |* |*|| ||*|*||| | ||*|||
-CG-CTTT-CACCGCTACACGACCGGG
[2024-11-07 17:29:08.419] [trace] Checking barcode kit_bc17
[2024-11-07 17:29:08.419] [trace] top window 5
[2024-11-07 17:29:08.419] [trace]
CTTCCGATCTAC-AGCTGGCAATGA
||*||||||||| | |**|||||||
CTACCGATCTACTA-CACGCAATGA
[2024-11-07 17:29:08.419] [trace] bottom window 12
[2024-11-07 17:29:08.419] [trace]
A-CTT-CATTGCCAGCTGTA-GATCGG-
* ||| ||   || |||**| ||*|||
CGCTTTCA---CC-GCTACACGACCGGG
[2024-11-07 17:29:08.419] [trace] Checking barcode kit_bc18
[2024-11-07 17:29:08.419] [trace] top window 6
[2024-11-07 17:29:08.419] [trace]
CTTCCGATCTG-TGACTGGCAATGA
||*|||||||* | ||**|||||||
CTACCGATCTACT-ACACGCAATGA
[2024-11-07 17:29:08.419] [trace] bottom window 11
[2024-11-07 17:29:08.419] [trace]
A-CTT-CATTGCCAGTCACAGATCGG-
* ||| ||**||*|  ||| ||*|||
CGCTTTCACCGCTA--CAC-GACCGGG
[2024-11-07 17:29:08.419] [trace] Checking barcode kit_bc19
[2024-11-07 17:29:08.419] [trace] top window 4
[2024-11-07 17:29:08.419] [trace]
CTTCCGATCTA-TACATGGCAATGA
||*|||||||| ||||*| ||||||
CTACCGATCTACTACACG-CAATGA
[2024-11-07 17:29:08.419] [trace] bottom window 13
[2024-11-07 17:29:08.419] [trace]
A-CTT-CATTGCCATGTATAGATCGG-
* ||| ||**||*|*  |* ||*|||
CGCTTTCACCGCTAC--AC-GACCGGG
[2024-11-07 17:29:08.419] [trace] Checking barcode kit_bc20
[2024-11-07 17:29:08.419] [trace] top window 4
[2024-11-07 17:29:08.419] [trace]
CTTCCGATCTAC-ATACGGCAATGA
||*||||||||| |*||| ||||||
CTACCGATCTACTACACG-CAATGA
[2024-11-07 17:29:08.419] [trace] bottom window 13
[2024-11-07 17:29:08.419] [trace]
A-CTT-CATTGCCGTATGTAGATCGG-
* ||| ||**||  ||*** ||*|||
CGCTTTCACCGC--TACAC-GACCGGG
[2024-11-07 17:29:08.419] [trace] Checking barcode kit_bc21
[2024-11-07 17:29:08.419] [trace] top window 5
[2024-11-07 17:29:08.419] [trace]
CTTCCGATCTAC-ATGTCGCAATGA
||*||||||||| |** ||||||||
CTACCGATCTACTACA-CGCAATGA
[2024-11-07 17:29:08.419] [trace] bottom window 12
[2024-11-07 17:29:08.419] [trace]
A-CTT-CATTGCGACATGTAGATCGG-
* ||| ||**||*|||*| |*  |||
CGCTTTCACCGCTACACG-AC--CGGG
[2024-11-07 17:29:08.419] [trace] Checking barcode kit_bc22
[2024-11-07 17:29:08.419] [trace] top window 5
[2024-11-07 17:29:08.419] [trace]
CTTCCGATCTTAG-ACTCGCAATGA
||*||||||| |* ||*||||||||
CTACCGATCT-ACTACACGCAATGA
[2024-11-07 17:29:08.419] [trace] bottom window 12
[2024-11-07 17:29:08.419] [trace]
A-CTT-CATTGCGAGTCTAAGATCGG-
* ||| ||**||*|  | |*||*|||
CGCTTTCACCGCTA--C-ACGACCGGG
[2024-11-07 17:29:08.419] [trace] Checking barcode kit_bc23
[2024-11-07 17:29:08.419] [trace] top window 5
[2024-11-07 17:29:08.419] [trace]
CTTCCGATCTTATCTAC--GCAATGA
||*||||||| | ||||  |||||||
CTACCGATCT-A-CTACACGCAATGA
[2024-11-07 17:29:08.419] [trace] bottom window 12
[2024-11-07 17:29:08.419] [trace]
A-CTT-CATTGCGTAGATAAGATCGG-
* ||| ||**|| ||*|*  ||*|||
CGCTTTCACCGC-TACAC--GACCGGG
[2024-11-07 17:29:08.419] [trace] Checking barcode kit_bc24
[2024-11-07 17:29:08.419] [trace] top window 5
[2024-11-07 17:29:08.419] [trace]
CTTCCGATCTG-TCATACGCAATGA
||*|||||||* | |*|||||||||
CTACCGATCTACT-ACACGCAATGA
[2024-11-07 17:29:08.419] [trace] bottom window 11
[2024-11-07 17:29:08.419] [trace]
A-CTT-CATTGCGTATGACAGATCGG-
* ||| ||**|| ||* || ||*|||
CGCTTTCACCGC-TAC-AC-GACCGGG
[2024-11-07 17:29:08.419] [trace] Two ends confidently predict different BCs: top bc kit_bc01, bottom bc kit_bc03
[2024-11-07 17:29:08.419] [trace] BC: unclassified
[2024-11-07 17:29:08.419] [trace] Barcode for 00016867-4590-42ec-b358-e4cf7ee81f13 is unclassified
[2024-11-07 17:29:08.529] [info] > Simplex reads basecalled: 1
[2024-11-07 17:29:08.529] [info] > 1 reads demuxed @ classifications/s: 1.000000e+01
[2024-11-07 17:29:08.529] [debug] Barcode distribution :
[2024-11-07 17:29:08.529] [debug] unclassified : 1
[2024-11-07 17:29:08.529] [debug] Classified rate 0%
[2024-11-07 17:29:08.529] [info] > finished barcode demuxing
  • Log for the configuration that works as expected:
dorado-0.8.2-linux-x64/bin/dorado
[2024-11-07 16:50:39.228] [info] Running: "demux" "-vv" "--output-dir" "debug_new_unclassified" "debug_new_unclassified_r250.sam" "--barcode-arrangement" "kit_adapter_config.toml" "--barcode-sequences" "kit_barcodes.fasta" "--no-trim"
[2024-11-07 16:50:39.228] [trace] AlignmentProcessingItems::initialise [ENTRY]
[2024-11-07 16:50:39.228] [trace] AlignmentProcessingItems::initialise [EXIT]
[2024-11-07 16:50:39.228] [info] num input files: 1
[2024-11-07 16:50:39.228] [debug] > barcoding threads 29, writer threads 3
[2024-11-07 16:50:39.259] [info] > starting barcode demuxing
[2024-11-07 16:50:39.259] [debug] Total reads processed: 1
[2024-11-07 16:50:39.259] [trace] pushed to pipeline: 1
[2024-11-07 16:50:39.261] [debug] > Kits to evaluate: 1
[2024-11-07 16:50:39.261] [trace] query cursor 29 target cursor 26
[2024-11-07 16:50:39.261] [trace] midstrand flank top dist 17 position 129 bc_loc 155 score 0.60465115
[2024-11-07 16:50:39.261] [trace]
CTACACGACGCTCTTCCGATCTNNNNNNNGCAATGAAGTCGCAGGGTTGG
| |*||||*| ||**| ||||*|||||||*|*||*||*||* |   ||||
C-AGACGAGG-TCAAC-GATCCCTCCCTTACCATCAAATCA-A---TTGG
[2024-11-07 16:50:39.261] [trace] query cursor 28 target cursor 27
[2024-11-07 16:50:39.261] [trace] midstrand flank bottom dist 16 position 261 bc_loc 288 score 0.627907
[2024-11-07 16:50:39.261] [trace]
CCAACCCTGCGACTTCATT-GCNNNNNNNAGATCG-GAAG-AGCGTCGTGTAG
*|*||| |||||||*| || ||||||||| ||||| |*|| | | ||**| |
GCGACC-TGCGACTCC-TTAGCATTTGAC-GATCGAGTAGTA-C-TCCCG-A-
[2024-11-07 16:50:39.261] [trace] query cursor 29 target cursor 23
[2024-11-07 16:50:39.261] [trace] top score dist 8 position 0 bc_loc 23 score 0.8139535
[2024-11-07 16:50:39.261] [trace]
CTACACGACGCTCTTCCGATCTNNNNNNNGCAATGAAGTCGCAGGGTTGG
  || |*|| ||*  |||||||||||||||||||||||||||||||||||
--AC-CAAC-CTA--CCGATCTACTACACGCAATGAAGTCGCAGGGTTGG
[2024-11-07 16:50:39.261] [trace] query cursor 28 target cursor 28
[2024-11-07 16:50:39.261] [trace] bottom score dist 19 position 38 bc_loc 66 score 0.55813956
[2024-11-07 16:50:39.261] [trace]
CCAAC--CCTGCGACTTCATTGCNNNNNNNAGATCG--GAAG-AGCGTCGTGTAG
|||*|  |||* ||*||*||| ||||||||| |||*  |||* ||*|*|* |||
CCATCGTCCTA-GAATTAATT-CCCCTAAAA-ATCTTTGAAATAGGGCCC-GTA-
[2024-11-07 16:50:39.261] [trace] Checking barcode kit_bc01
[2024-11-07 16:50:39.261] [trace] top window 1
[2024-11-07 16:50:39.261] [trace]
CTTCCGATCTACTACACGCAATGA
||*|||||||||||||||||||||
CTACCGATCTACTACACGCAATGA
[2024-11-07 16:50:39.261] [trace] bottom window 13
[2024-11-07 16:50:39.261] [trace]
-ACTTCATTGCGTGTAGTAGATCGG-
 |*||*|||*|** ||**| |||**
GAATTAATTCCCC-TAAAA-ATCTTT
[2024-11-07 16:50:39.261] [trace] Checking barcode kit_bc02
[2024-11-07 16:50:39.261] [trace] top window 4
[2024-11-07 16:50:39.261] [trace]
CTTCCGATCTACTAGTA-GCAATGA
||*|||||||||||* | |||||||
CTACCGATCTACTAC-ACGCAATGA
[2024-11-07 16:50:39.261] [trace] bottom window 12
[2024-11-07 16:50:39.261] [trace]
-ACTTCATTGCTACTAGTAGATCGG-
 |*||*|||*|* |||**| |||**
GAATTAATTCCC-CTAAAA-ATCTTT
[2024-11-07 16:50:39.261] [trace] Checking barcode kit_bc03
[2024-11-07 16:50:39.261] [trace] top window 4
[2024-11-07 16:50:39.261] [trace]
CTTCCGATCTAGTGTACGCAATGA
||*||||||||*|**|||||||||
CTACCGATCTACTACACGCAATGA
[2024-11-07 16:50:39.261] [trace] bottom window 13
[2024-11-07 16:50:39.261] [trace]
-ACTTCATTGCG-TACACTAGATCGG-
 |*||*|||*|* ||*|  | |||**
GAATTAATTCCCCTAAA--A-ATCTTT
[2024-11-07 16:50:39.261] [trace] Checking barcode kit_bc04
[2024-11-07 16:50:39.261] [trace] top window 5
[2024-11-07 16:50:39.261] [trace]
CTTCCGATCTA-T-CACTAGCAATGA
||*|||||||| | |||  |||||||
CTACCGATCTACTACAC--GCAATGA
[2024-11-07 16:50:39.261] [trace] bottom window 13
[2024-11-07 16:50:39.261] [trace]
-ACTTCATTGCTAGTGATAGATCGG-
 |*||*|||*|** |*|*| |||**
GAATTAATTCCCC-TAAAA-ATCTTT
[2024-11-07 16:50:39.261] [trace] Checking barcode kit_bc05
[2024-11-07 16:50:39.261] [trace] top window 7
[2024-11-07 16:50:39.261] [trace]
CTTCCGATCT-C-AGCTGTGCAATGA
||*||||||| | | |** |||||||
CTACCGATCTACTA-CAC-GCAATGA
[2024-11-07 16:50:39.261] [trace] bottom window 13
[2024-11-07 16:50:39.261] [trace]
-ACTTCATTGCACAGCTGAGA-TCGG-
 |*||*|||*| |  ||*|*| ||**
GAATTAATTCC-C--CTAAAAATCTTT
[2024-11-07 16:50:39.261] [trace] Checking barcode kit_bc06
[2024-11-07 16:50:39.261] [trace] top window 4
[2024-11-07 16:50:39.261] [trace]
CTTCCGATCTCAGT-CACGCAATGA
||*||||||| |*| ||||||||||
CTACCGATCT-ACTACACGCAATGA
[2024-11-07 16:50:39.261] [trace] bottom window 14
[2024-11-07 16:50:39.261] [trace]
-ACTTCATTGCGTGACTGAGA-TCGG-
 |*||*|||*|*   ||*|*| ||**
GAATTAATTCCC---CTAAAAATCTTT
[2024-11-07 16:50:39.261] [trace] Checking barcode kit_bc07
[2024-11-07 16:50:39.261] [trace] top window 6
[2024-11-07 16:50:39.261] [trace]
CTTCCGATCT-CATGTATGCAATGA
||*||||||| | |**|*|||||||
CTACCGATCTAC-TACACGCAATGA
[2024-11-07 16:50:39.261] [trace] bottom window 13
[2024-11-07 16:50:39.261] [trace]
-ACTTCATTGCA-TACATGAGATCGG-
 |*||*|||*|* ||*|  | |||**
GAATTAATTCCCCTAAA--A-ATCTTT
[2024-11-07 16:50:39.261] [trace] Checking barcode kit_bc08
[2024-11-07 16:50:39.261] [trace] top window 6
[2024-11-07 16:50:39.261] [trace]
CTTCCGATCT-CGTATGTGCAATGA
||*||||||| | ||***|||||||
CTACCGATCTAC-TACACGCAATGA
[2024-11-07 16:50:39.261] [trace] bottom window 12
[2024-11-07 16:50:39.261] [trace]
-ACTTCATTGCACATACGAGATCGG-
 |*||*|||*|*| ||**| |||**
GAATTAATTCCCC-TAAAA-ATCTTT
[2024-11-07 16:50:39.261] [trace] Checking barcode kit_bc09
[2024-11-07 16:50:39.261] [trace] top window 6
[2024-11-07 16:50:39.261] [trace]
CTTCCGATCTGAC-ATGTGCAATGA
||*||||||| || |***|||||||
CTACCGATCT-ACTACACGCAATGA
[2024-11-07 16:50:39.261] [trace] bottom window 13
[2024-11-07 16:50:39.261] [trace]
-ACTTCATTGCACATGTCAGATCGG-
 |*||*|||*|*| |***| |||**
GAATTAATTCCCC-TAAAA-ATCTTT
[2024-11-07 16:50:39.261] [trace] Checking barcode kit_bc10
[2024-11-07 16:50:39.261] [trace] top window 6
[2024-11-07 16:50:39.261] [trace]
CTTCCGATCTGAGT-CTA-GCAATGA
||*||||||| |*| | | |||||||
CTACCGATCT-ACTAC-ACGCAATGA
[2024-11-07 16:50:39.261] [trace] bottom window 14
[2024-11-07 16:50:39.261] [trace]
-ACTTCATTGCTAGACTCAGA-TCGG-
 |*||*|||*|*   ||*|*| ||**
GAATTAATTCCC---CTAAAAATCTTT
[2024-11-07 16:50:39.261] [trace] Checking barcode kit_bc11
[2024-11-07 16:50:39.261] [trace] top window 6
[2024-11-07 16:50:39.261] [trace]
CTTCCGATCTG-TAGATAGCAATGA
||*|||||||* ||*|* |||||||
CTACCGATCTACTACAC-GCAATGA
[2024-11-07 16:50:39.261] [trace] bottom window 12
[2024-11-07 16:50:39.261] [trace]
-ACTTCATTGCTATCTACAGATCGG-
 |*||*|||*|*  |||*|*|||**
GAATTAATTCCC--CTAAAAATCTTT
[2024-11-07 16:50:39.261] [trace] Checking barcode kit_bc12
[2024-11-07 16:50:39.261] [trace] top window 5
[2024-11-07 16:50:39.261] [trace]
CTTCCGATCTG-TATGACGCAATGA
||*|||||||* ||* |||||||||
CTACCGATCTACTAC-ACGCAATGA
[2024-11-07 16:50:39.261] [trace] bottom window 12
[2024-11-07 16:50:39.262] [trace]
-ACTTCATTGCGTCATACAGATCGG-
 |*||*|||*|* | ||*|*|||**
GAATTAATTCCC-C-TAAAAATCTTT
[2024-11-07 16:50:39.262] [trace] Checking barcode kit_bc13
[2024-11-07 16:50:39.262] [trace] top window 7
[2024-11-07 16:50:39.262] [trace]
CTTCCGATCTG-TGTAGTGCAATGA
||*|||||||* |**|* |||||||
CTACCGATCTACTACAC-GCAATGA
[2024-11-07 16:50:39.262] [trace] bottom window 11
[2024-11-07 16:50:39.262] [trace]
-ACTTCATTGCACTACACAGATCGG-
 |*||*|||*|*|||*| | |||**
GAATTAATTCCCCTAAA-A-ATCTTT
[2024-11-07 16:50:39.262] [trace] Checking barcode kit_bc14
[2024-11-07 16:50:39.262] [trace] top window 5
[2024-11-07 16:50:39.262] [trace]
CTTCCGATCTTACTAGT-GCAATGA
||*||||||| ||||** |||||||
CTACCGATCT-ACTACACGCAATGA
[2024-11-07 16:50:39.262] [trace] bottom window 11
[2024-11-07 16:50:39.262] [trace]
-ACTTCATTGCACTAGTAAGATCGG-
 |*||*|||*|*|||* || |||**
GAATTAATTCCCCTAA-AA-ATCTTT
[2024-11-07 16:50:39.262] [trace] Checking barcode kit_bc15
[2024-11-07 16:50:39.262] [trace] top window 4
[2024-11-07 16:50:39.262] [trace]
CTTCCGATCTG-TACACTGCAATGA
||*|||||||* ||||| |||||||
CTACCGATCTACTACAC-GCAATGA
[2024-11-07 16:50:39.262] [trace] bottom window 13
[2024-11-07 16:50:39.262] [trace]
-ACTTCATTGCAGTGTACAGATCGG-
 |*||*|||*|**|**| | |||**
GAATTAATTCCCCTAAA-A-ATCTTT
[2024-11-07 16:50:39.262] [trace] Checking barcode kit_bc16
[2024-11-07 16:50:39.262] [trace] top window 6
[2024-11-07 16:50:39.262] [trace]
CTTCCGATCTTAGTG-ATGCAATGA
||*||||||| |*|* |*|||||||
CTACCGATCT-ACTACACGCAATGA
[2024-11-07 16:50:39.262] [trace] bottom window 12
[2024-11-07 16:50:39.262] [trace]
-ACTTCATTGCATCACTAAGA-TCGG-
 |*||*|||*|  | ||||*| ||**
GAATTAATTCC--C-CTAAAAATCTTT
[2024-11-07 16:50:39.262] [trace] Checking barcode kit_bc17
[2024-11-07 16:50:39.262] [trace] top window 5
[2024-11-07 16:50:39.262] [trace]
CTTCCGATCTAC-AGCTGGCAATGA
||*||||||||| | |**|||||||
CTACCGATCTACTA-CACGCAATGA
[2024-11-07 16:50:39.262] [trace] bottom window 12
[2024-11-07 16:50:39.262] [trace]
-ACTTCATTGCCAGCTGTAGATCGG-
 |*||*|||*||  ||**|*|||**
GAATTAATTCCC--CTAAAAATCTTT
[2024-11-07 16:50:39.262] [trace] Checking barcode kit_bc18
[2024-11-07 16:50:39.262] [trace] top window 6
[2024-11-07 16:50:39.262] [trace]
CTTCCGATCTG-TGACTGGCAATGA
||*|||||||* | ||**|||||||
CTACCGATCTACT-ACACGCAATGA
[2024-11-07 16:50:39.262] [trace] bottom window 12
[2024-11-07 16:50:39.262] [trace]
-ACTTCATTGCCAGTCACAGATCGG-
 |*||*|||*||* |*|*| |||**
GAATTAATTCCCC-TAAAA-ATCTTT
[2024-11-07 16:50:39.262] [trace] Checking barcode kit_bc19
[2024-11-07 16:50:39.262] [trace] top window 4
[2024-11-07 16:50:39.262] [trace]
CTTCCGATCTA-TACATGGCAATGA
||*|||||||| ||||*| ||||||
CTACCGATCTACTACACG-CAATGA
[2024-11-07 16:50:39.262] [trace] bottom window 12
[2024-11-07 16:50:39.262] [trace]
-ACTTCATTGCCATGTATAGATCGG-
 |*||*|||*||*|**| | |||**
GAATTAATTCCCCTAAA-A-ATCTTT
[2024-11-07 16:50:39.262] [trace] Checking barcode kit_bc20
[2024-11-07 16:50:39.262] [trace] top window 4
[2024-11-07 16:50:39.262] [trace]
CTTCCGATCTAC-ATACGGCAATGA
||*||||||||| |*||| ||||||
CTACCGATCTACTACACG-CAATGA
[2024-11-07 16:50:39.262] [trace] bottom window 12
[2024-11-07 16:50:39.262] [trace]
-ACTTCATTGCCGTATGTAGATCGG-
 |*||*|||*||*||** | |||**
GAATTAATTCCCCTAAA-A-ATCTTT
[2024-11-07 16:50:39.262] [trace] Checking barcode kit_bc21
[2024-11-07 16:50:39.262] [trace] top window 5
[2024-11-07 16:50:39.262] [trace]
CTTCCGATCTAC-ATGTCGCAATGA
||*||||||||| |** ||||||||
CTACCGATCTACTACA-CGCAATGA
[2024-11-07 16:50:39.262] [trace] bottom window 13
[2024-11-07 16:50:39.262] [trace]
-ACTTCATTGCGACATGTAGATCGG-
 |*||*|||*|* | |**|*|||**
GAATTAATTCCC-C-TAAAAATCTTT
[2024-11-07 16:50:39.262] [trace] Checking barcode kit_bc22
[2024-11-07 16:50:39.262] [trace] top window 5
[2024-11-07 16:50:39.262] [trace]
CTTCCGATCTTAG-ACTCGCAATGA
||*||||||| |* ||*||||||||
CTACCGATCT-ACTACACGCAATGA
[2024-11-07 16:50:39.262] [trace] bottom window 13
[2024-11-07 16:50:39.262] [trace]
-ACTTCATTGCGAGTCTAAGA-TCGG-
 |*||*|||*|*   ||||*| ||**
GAATTAATTCCC---CTAAAAATCTTT
[2024-11-07 16:50:39.262] [trace] Checking barcode kit_bc23
[2024-11-07 16:50:39.262] [trace] top window 5
[2024-11-07 16:50:39.262] [trace]
CTTCCGATCTTATCTAC--GCAATGA
||*||||||| | ||||  |||||||
CTACCGATCT-A-CTACACGCAATGA
[2024-11-07 16:50:39.262] [trace] bottom window 12
[2024-11-07 16:50:39.262] [trace]
-ACTTCATTGCG-TAGATAAGATCGG-
 |*||*|||*|* || | || |||**
GAATTAATTCCCCTA-A-AA-ATCTTT
[2024-11-07 16:50:39.262] [trace] Checking barcode kit_bc24
[2024-11-07 16:50:39.262] [trace] top window 5
[2024-11-07 16:50:39.262] [trace]
CTTCCGATCTG-TCATACGCAATGA
||*|||||||* | |*|||||||||
CTACCGATCTACT-ACACGCAATGA
[2024-11-07 16:50:39.262] [trace] bottom window 13
[2024-11-07 16:50:39.262] [trace]
-ACTTCATTGCG-TATGACAGATCGG-
 |*||*|||*|* ||* | | |||**
GAATTAATTCCCCTAA-A-A-ATCTTT
[2024-11-07 16:50:39.262] [trace] Scores: kit_bc01 1, kit_bc02 4, kit_bc03 4, kit_bc06 4, kit_bc20 4, kit_bc19 4, kit_bc15 4, kit_bc24 5, kit_bc23 5, kit_bc22 5, kit_bc21 5, kit_bc17 5, kit_bc14 5, kit_bc12 5, kit_bc04 5, kit_bc16 6, kit_bc11 6, kit_bc18 6, kit_bc10 6, kit_bc09 6, kit_bc08 6, kit_bc07 6, kit_bc13 7, kit_bc05 7,
[2024-11-07 16:50:39.262] [trace] BC: kit_barcode01
[2024-11-07 16:50:39.262] [trace] Barcode for 00016867-4590-42ec-b358-e4cf7ee81f13 is kit_barcode01
[2024-11-07 16:50:39.372] [info] > Simplex reads basecalled: 1
[2024-11-07 16:50:39.372] [info] > 1 reads demuxed @ classifications/s: 1.000000e+01
[2024-11-07 16:50:39.372] [debug] Barcode distribution :
[2024-11-07 16:50:39.372] [debug] kit_barcode01 : 1
[2024-11-07 16:50:39.372] [debug] Classified rate 100%
[2024-11-07 16:50:39.372] [info] > finished barcode demuxing
@malton-ont
Copy link
Collaborator

Hi @GeorgescuC,

I can see from the code that we look for the best top and bottom barcode results, which in the first case should be barcodes 1 (top penalty=1) and 3 (bottom penalty = 10). But then we compare the best overall penalties of those two barcodes - this is still the top penalty for barcode 1, but it's also the top penalty (=4) for barcode 3. So our best bottom results is actually pretty confidently predicting barcode3, but for the top rather than the bottom. What I'm not understanding is why this then fails the check, as you say your max_barcode_penalty is 3.

This all works out ok in the passing case as the best bottom penalty is 13 in both barcode 1 and barcode 3, so our std::min_element call finds the first value which is barcode 1 - the same as the best top penalty and everything continues fine.

The general issue is that the bottom barcodes are generating such terrible scores - I think there's a tacit assumption that the best bottom penalty barcode will have a better bottom penalty than it does a top penalty, but that's not the case here. Can I check that you've reverse-complemented the sequences for mask2_front and mask2_rear in the arrangement file? See the docs here.

If that's not the issue, then something looks odd. Are you able to provide us with your barcode arrangement, sequences and the sample read?

@malton-ont malton-ont added the barcode Issues related to barcoding label Nov 11, 2024
@GeorgescuC
Copy link
Author

Hi @malton-ont ,

Thank you for the quick reply. I have attached an archive with the kit description (toml and fasta files) and example read in my original post at the end of the run environment section, but here is a recap of the masks used:

mask1_front = "CTACACGACGCTCTTCCGATCT"
mask1_rear = "GCAATGAAGTCGCAGGGTTGG"
mask2_front = "CTACACGACGCTCTTCCGATCT"
mask2_rear = "GCAATGAAGTCGCAGGGTTGG"

Given that the documentation page you linked to indicates that mask2 sequences are RC when running in double-ended barcode mode and states "For double-ended barcodes which are symmetric, the flank and barcode sequences for front and rear windows are same.", I did not RC them myself. This seems like it is working as I understood it given the trace log that aligns the read to the bottom flank sequence CCAACCCTGCGACTTCATTGCNNNNNNNAGATCGGAAGAGCGTCGTGTAG, which uses RCs of mask2 flanks.

For reference, another of my tests involved running with mask2 sequences being manually RC-ed and again running in double ended barcode mode, but that reduced the classification rate from 65% to 60.5%.
I also tried demuxing using only one sided barcoding on both ends (this time using the RC for the read_only_barcodes=true version), but results also looked off (potentially due to midstrand flanks detection), so I wanted to dig into that once this current issue is resolved as there may be overlap.

For the classification scoring itself, given my understanding of how the scoring was going to work, there are multiple steps where the bottom strand barcode should not have been called as confident. Here is how I expected the algorithm to operate once the flank alignments have been set, and where an issue could be:
look at both strands:

  • top strand:
    • find barcode 01 with 1 edit distance as the best hit
    • keep barcode 01 as a good match as it is within max_barcode_penalty = 6
    • next best barcodes are at 4 edit distance, so they get discarded as they are 3 or more distance further away (min_barcode_penalty_dist)
    • return barcode 01 with 1 penalty score as the proper match
  • bottom strand:
    • find barcodes 03, 11 and 13 with 10 edit distance as the best hits
    • since there is a tie for the best hit, no confident hit can be assigned <---- potential issue here
    • return no barcode as the proper match
  • compare both returned barcodes, since only one side returned a hit, accept that one
  • final classification is barcode 01

Assuming a case where barcodes 11 and 13 were not a tie with barcode 03 on the bottom strand, I would then have assumed this logic (keeping other scores the same):
look at both strand:

  • top strand:
    • find barcode 01 with 1 edit distance as the best hit
    • keep barcode 01 as a good match as it is within max_barcode_penalty = 6
    • next best barcodes are at 4 edit distance, so they get discarded as they are 3 or more distance further away (min_barcode_penalty_dist)
    • return barcode 01 with 1 penalty score as the proper match
  • bottom strand:
    • find barcodes 03 with 10 edit distance as the best hit
    • keep barcode 03 as a low confidence match since its penalty is more than max_barcode_penalty
    • next best barcodes are at 1 edit distance, which is within min_separation_only_dist, so the best match can not be confidently called <---- potential issue here
    • return no barcode as the proper match
  • compare both returned barcodes, since only one side returned a hit, accept that one
  • final classification is barcode 01

I would also further have assumed that since the documentation states For double ended barcodes, the best window (either from the front or rear of the read) is chosen based on the alignment of the flanking mask sequences., if the front barcode is within max_barcode_penalty but the bottom barcode is not, any bottom barcode result would be ignored. This however may be more debatable because it would be hard to find a case where a barcode with more than max_barcode_penalty can still be further away from the next best hit by more than min_separation_only_dist.

Best,
Christophe.

@malton-ont
Copy link
Collaborator

Hi @GeorgescuC,

The read you have provided classifies correctly as barcode01 with that arrangement - I'd be interested in the one you showed above that rejected based on barcode03 having a top edit distance of 4...?

That read also doesn't appear to have a rear barcode at all - are you certain the barcodes are on both ends? Dorado is expecting to look for a sequence like:

5' --- ADAPTER/PRIMER --- MASK1_FRONT --- BARCODE_1 --- MASK1_REAR --- READ --- RC(MASK2_REAR) --- RC(BARCODE_2) --- RC(MASK2_FRONT ) --- 3'

(and the reverse strand)

If this is correct, I suspect that the adapter trimming during basecalling has been over-zealous - you should try rebasecalling with the --no-trim option before demuxing. Another option would be to perform classification during basecalling (by providing the custom kit info to the basecaller command) and then demuxing with --no-classify instead of providing the kit info to the demux command - this should ensure that adapter trimming is performing after the barcodes have been identified. See the barcoding and read trimming documentation.

Your description of the algorithm is not quite correct. For double ended barcodes, dorado:

  • assigns a top and bottom penalty (edit distance) to a result for each barcode
  • finds the result with the best (lowest) top penalty and the result with the best bottom penalty
  • checks if these results have:
    • both confidently predicted a barcode and **
    • have a similar penalty and, if so
    • checks they have predicted the same barcode
  • finds the best two barcodes based on the best penalty of each
  • checks that the best result is close to the end of the read and
    • is good enough and sufficiently better than the second best, or
    • is much better than the second best result
  • finally, if --barcode-both-ends has been set, checks that both the top and bottom penalty of the best result are both good enough

** this is the suspicious part - it doesn't require that the confident prediction is the top or bottom of the respective result so if the best bottom result is bad but has a highly confident (but not the best) top penalty we may still declassify it.

@GeorgescuC
Copy link
Author

Hi @malton-ont ,

The 2 configurations that I provided the run logs for use the same masks and so both cases have barcode03 with a top edit distance of 4, the only difference is the rear_barcode_window size which is 175bp in the case that classifies it properly, and 250bp in the case that does not classify it. In the later case, a very slightly better region is found to align the bottom barcode, but it is still low quality.

Regarding the read structure, the barcode can be present on either end of the read, but will only be present on one end. The molecules here are double stranded cDNA and the barcode being demuxed comes from the original RNA capture (comparable to an UMI in its location, although it is not one). This is a sub-barcode, so the ONT adapter/barcode were added to the cDNA for sequencing and then used for basecalling + demuxing. I then took the uBAM of the ONT barcode for this library and ran demuxing with the provided kit configs.

Basically, these reads can have either of these 2 structures (ignoring the ONT adapters and barcodes trimmed in the first demux round):
5' --- MASK1_FRONT --- BARCODE_1 --- MASK1_REAR --- READ --- 3'
or
5' --- READ --- RC(MASK2_REAR) --- RC(BARCODE_2) --- RC(MASK2_FRONT) --- 3'

I understand that using this config ideally works best when the barcode is on both sides, but given the size of the masks and flanks used, I thought the penalty scores of the side that does not have the barcode should always be so bad that it is never considered as a good enough match, given min_separation_only_dist = 5 while the barcodes are only 3 edit distance away.

I also thought a better approach might be to first look for front barcodes only, then for rear barcodes only among the unclassified reads, as this would also help with reorienting the reads where the barcode was found at the rear. However the tests I ran did not give me confidence in the results.
I tried a few things to check for high-level consistency and false positive calls:

  • front only barcode search with forward masks:
    • 35.478127% classified rate
  • rear only barcode search with forward masks (which should not find anything since for the rear the masks should be RCed):
    • if I leave front_barcode_window = 175, 35.478127% classified rate, identical to above because it still aligns at the front, so it seems like it is completely ignoring rear_only_barcodes = true
    • if I set front_barcode_window = 0, I get a segfault after a couple reads, and those few reads are unclassified because midstrand barcode flanks with score 1
  • rear only barcode search with RC masks:
    • 49.103554% classified rate, which seems too high in comparison with the results looking for front barcodes given that they are likely present in a similar ratio
  • barcode on both sides with both forward masks (case discussed in this issue):
    • 65.043594% classified rate, which is not equal to the sum of classifying either side independently
  • barcode on both sides with front forward masks and rear RCed masks:
    • 60.5823% classified rate, here again the rear masks are incorrect so it should not classify so many more read than just looking for a front barcode alone.

If you think it would be easier or more helpful, I can privately share the whole uBAM used in these tests (1,224,219 reads for slightly below 1GB file size)

Small note on the verbose debugging, it would be nice if the log indicated what read is currently being aligned/scored for all reads and at the beginning of that read's log.

Best,
Christophe

@malton-ont
Copy link
Collaborator

Hi @GeorgescuC,

Basically, these reads can have either of these 2 structures (ignoring the ONT adapters and barcodes trimmed in the first demux round):
5' --- MASK1_FRONT --- BARCODE_1 --- MASK1_REAR --- READ --- 3'
or
5' --- READ --- RC(MASK2_REAR) --- RC(BARCODE_2) --- RC(MASK2_FRONT) --- 3'

Are these both possible template strands? Given your mask1 and mask2 are identical, if the second strand here is meant to represent a complement strand then this should be treated as a single ended barcode - i.e. you should remove mask2 from your config. By providing both sets of masks you are telling dorado that it expects a barcode at both ends, not at either end. If this is genuinely a prep where the barcode can ligate to either end of the template strand, dorado does not support this as a standard configuration.

I'm surprised that running single-ended on the front mask and then single ended with rear_only_barcodes doesn't provide an improvement (that was going to be my suggestion) - this should only search rear_barcode_window bases at the end of the read so if your sequence is longer than that it should never see any front barcodes.

So I would expect these configs to work:

fwd_config
mask1_front = MASK1_FRONT
mask1_rear = MASK2_REAR

Gets you your initial 35%. Then, on the remaining unclassified reads:

rear_config
mask1_front = RC(MASK1_FRONT)
mask2_front = RC(MASK2_REAR)
rear_only_barcodes = true

It sounds like your main issue with this method is that you are getting a higher proportion of reads with the barcode at the rear? I'm not sure I can comment on how likely that is based on your sample prep.
If you raise an issue with our support team (referencing this issue and my name), they can provide you with a link to upload your data and I can take a closer look.

@GeorgescuC
Copy link
Author

Hi @malton-ont ,

Yes I removed mask2 in the instances where searching for only on one side. I also extended the barcodes to have both their normal version and RC version in some of the tests, and that works as expected. The configs I used are:

5'-search forward-masks:

mask1_front = MASK1_FRONT
mask1_rear = MASK1_REAR

barcode1_pattern = BARCODES and RC(BARCODES)

3'-search forward-masks (expecting to not find anything):

mask1_front = MASK1_FRONT
mask1_rear = MASK1_REAR
rear_only_barcodes = true

barcode1_pattern = BARCODES and RC(BARCODES)

3'-search rc-masks:

mask1_front = RC(MASK1_REAR)
mask1_rear = RC(MASK1_FRONT)
rear_only_barcodes = true

barcode1_pattern = RC(BARCODES)

For all these tests, I ran using the full uBAM of the ONT barcode so classification rates are comparable.

It sounds like your main issue with this method is that you are getting a higher proportion of reads with the barcode at the rear?

There seem to be a few issues that may be compounding so I originally made this issue about a single one of them to try to figure out things step by step, but then mentioned the others as well:

  • using a double sided barcode search, in at least one case where a good barcode match is found (1 penalty) at the 5' front, using different window sizes for the 3' rear search where both "best matches" are bad (10 and 11 penalty at best) by more than max_barcode_penalty and with ties between barcodes yields different results: one case properly classifies based on the front barcode found, one case leaves the read unclassified because "ambiguous". Given in both cases the best 3' rear match is just as bad, I expect the read to be classified in both cases to the 5' front match, or to at least be consistent and have the same final result.
  • using rear_only_barcodes = true with a non-zero front_barcode_window and the same forward-masks as the front search gives the same result as doing the default front barcode search, so it seems like rear_only_barcode is ignored.
  • using read_only_barcodes = true with a zero front_barcode_window and the same forward-masks segfaults after a few reads
  • using read_only_barcodes = true with a zero front_barcode_window and the rc-masks runs fine, but the classification rate seems unexpectedly high given the experimental design. I cannot rule out yet that the results are true, but given the 3 issues above, I have not dedicated time to digging into it.

I will contact your support team then. I think it will much easier to figure out what is going wrong (either an error on my side or a bug) with all files in hand. I will include all the configs I used and a short list of commands I ran to quickly reproduce.

Best,
Christophe.

@malton-ont malton-ont self-assigned this Nov 18, 2024
@GeorgescuC
Copy link
Author

Hi @malton-ont ,

I have contacted your support team, who provided me with a link to share the data. I have now uploaded the full ubam, kit configuration files, and list of commands to reproduce the issue there.

Best,
Christophe.

@malton-ont
Copy link
Collaborator

@GeorgescuC,

I've received the files, thanks. From a preliminary investigation, it appears that your classification rates of:

  • FF5p - 35.4%
  • RC3p - 48.8%
  • Remainder unclassified - 15.7%

are correct - performing the RC3p classification on the remaining unclassified reads from the FF5p gives a classification rate of 75.7% which, when applied to the 64.6% unclassified fraction, gives 48.9%. This matches the RC3p run on the full dataset.

I believe the reason you are seeing an identical classification rate between your FF5p and FF3p configurations is because you have incorrectly specified rear_barcode_only instead of rear_only_barcodes - fixing this leads to a classified rate of 0.5%, which seems reasonable. (This is the same in the version that segfaults).

I've determined that we can make a change in dorado such that your FF5pRC3p arrangement would classify all of these reads in a single pass with a classification rate of 84.2% (matching the individual runs). I also verified that running with this change and adding --barcode-both-ends gives a classification rate of 0.002%, indicating that almost no reads have barcodes on both ends and confirming the above analysis.

We'll aim to get this change out in a future release.

@GeorgescuC
Copy link
Author

Hi @malton-ont ,

For the 3p versions I see it now, there was no log mentioning it so I did not notice (since other errors usually led to a segfault). It also explains why making the front window size 0 made that version segfault.

Being able to classify both with a single pass would be nice. Would it be possible to mention in a tag in the BAM which side the barcode was found on? That would allow easily reorienting the reads if desired. If not the two-pass single end will still be useful.

It looks like the changes you made are in the master branch, are these all? If so I will try to build from it (unless you have a build you can share) and see if specific example reads that I was looking at classify as expected now.

Thank you for your help,
Christophe.

@malton-ont
Copy link
Collaborator

malton-ont commented Nov 26, 2024

Hi @GeorgescuC,

Unused entries in a .toml file are not strictly an error, so we couldn't detect the incorrect keyword I'm afraid. We can tighten the validation on the windows though to prevent invalid values.

We've had other requests to extend the information available regarding the barcode classification. I'll raise this internally.

Yes, those changes are the ones I used to generate the numbers above. You should be able to build from that branch.

One other thought occurs to me. It looks like you may have done a previous round of demuxing to handle a second set of barcodes? If so, you may want to try doing that demux with the --no-trim option and then expanding your search windows for the second demux. This might improve the classification rate of the front barcodes, since this is substantially lower than for the rear barcodes.

@GeorgescuC
Copy link
Author

Hi @malton-ont ,

I will try to build from that branch then and look through this data again and let you know what I find.

I did also actually rerun basecalling with the --no-trim option on the full run over the weekend to rule out the trimming as the source for some differences on the ends of reads (different barcode), so I can compare how both versions perform in terms of second barcode classification. Before this run I used to run basecalling without trimming with longer windows for barcodes, so I never ran that comparison.

Best,
Christophe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
barcode Issues related to barcoding
Projects
None yet
Development

No branches or pull requests

2 participants