tiny-count: new edit pattern modes for the Mismatches selector #337

AlexTate · 2024-05-31T18:22:17Z

Rules that define a mismatch requirement can be extended to require a specific edit pattern. Two choices are available for this parameter:

ADAR: all mismatches must follow the A → I edit pattern which is characteristic of the double-stranded RNA-specific adenosine deaminase (ADAR) enzyme family. Inosene is recognized as guanosine by reverse transcriptase and therefore represented as G when sequenced, so this pattern is represented as A → G in sequencing data.
TUT: all mismatches must follow the N → U edit pattern at the 3' terminus which is characteristic of the Terminal Uridylyl Transferase (TUT) enzyme family. Mismatches must be consecutive from the 3' end. Reverse transcription prior to sequencing means this pattern is represented as N → T in sequencing data.

This option applies globally to all rules except those that lack a mismatch requirement. Rules without the requirement will continue to allow any number of mismatches following any edit pattern.

This PR contains the following additional changes:

Mismatches, for now, can no longer be calculated from the CIGAR string. All alignments must therefore report an NM tag, and if the tag is missing, it is treated as an error. The prior CIGAR method failed to disambiguate the M match operator.
~~Soft clipped bases are now honored and reflected in the sequence (and therefore 5' NT), start, and length.~~(Edit 6/10: after further discussion, there is more work to be done in reconciling inconsistencies across selectors when using this approach. This might be revisited in a future issue.)
Selectors that use the NumericalMatch class, including the Length and Mismatches selectors, are more forgiving when parsing and validating definitions. Leading and trailing whitespace is now permissible, as well as variable whitespace around commas and hyphens. Definitions composed entirely of whitespace will continue to fail validation.
Unmapped alignments are no longer skipped by recursively calling _next_alignment() in the AlignmentIter class. Python's max recursion depth is actually quite shallow (1000), so a SAM/BAM file with 1000+ unmapped alignments would have produced a RecursionError.

…etailed_mismatch is True. Other major changes to AlignmentIter: - Soft clipping is now honored in the Seq, Start, and End alignment dictionary keys. - The constructor arg has_nm has been changed to expected_tags, which is just a tuple of the tags found in the first alignment. At construction this is checked for the NM tag, and if the new constructor arg detailed_mismatch is True, the MD tag is also checked. It's an error if these tags are missing but expected. - CIGAR parsing for edit distance is no longer supported until the method can be rewritten. The current implementation does not properly disambiguate M operations. For this reason the NM tag is now (temporarily) required. - Unmapped alignments are no longer skipped by recursively calling _next_alignment(). Python's max recursion depth is actually quite shallow, so a SAM/BAM file with 1000+ unmapped alignments would produce a RecursionError.

… command line argument, and the new arguments for AlignmentIter. If the NM tag is missing from the first alignment, this is now treated as an error (see previous commit). If the MD tag is missing and a mismatch pattern was specified, this is treated as an error.

- EditMatch: now used for the Mismatches selector rather than the generic NumericalMatch. This was necessary because we now have to pass the entire alignment dictionary into Mismatch selectors rather than just the NM value. - AdarEditMatch: evaluates the A->G mismatch pattern - TutEditMatch: evaluates the N->U 3' terminal uridylation mismatch pattern.

…). Stage 2 alignment now passes the entire alignment dictionary to *EditMatch classes so that edit pattern can be evaluated using the SEQ field and MD tag.

… for specifying edit pattern.

…or more than just length selection for a while now... Whitespace is now removed from the selector definition prior to validating and parsing it. This relaxes some formatting requirements for the definition, e.g., a range with spaces around the hyphen is now acceptable.

…ent now defaults to False. Since we are temporarily removing mismatch calculation from the CIGAR string, missing NM tags are now treated as an error. Previous commits on this branch only checked the first alignment for the NM tag.

…to the lookup method

…are required for the tiny-count help string.

…delimiter-surrounding whitespace from definitions before parsing. This allows more flexibility for definitions like "1 - 2" or "3 " which would have produced a potentially confusing error. Cleaning up the regex that validates NumericalMatch definitions (mainly removing the ambiguous back reference). If validation fails, the original definition is shown to the user rather than the whitespace-cleaned definition.

…am since this otherwise isn't a target for tests. Adding more test cases to the NumericalMatch validation test. Finally, updating an unrelated decollapse test that had slipped through the cracks.

… and the temporary removal of mismatch calculation from the cigar string

…, skip, and pad). If found, the exception includes the offending record number and the basename of the file. In this commit I'm also moving static methods out of AlignmentIter and into standalone cdef functions. For cleanup and better testability.

…tion type as AlignmentIter when checking the first alignment for missing NM tag

…nt in MD strings. If it isn't preceded by another digit, skip the unnecessary work of parsing it into an int and incrementing the read position by 0.

…alignment tables when a mismatch pattern is specified.

AlexTate · 2024-06-10T19:07:15Z

@taimontgomery
Additional changes:

Alignments are rejected if they report incompatible CIGAR operators (soft clip, hard clip, skip, and pad). If found, the exception includes the offending record number and the file's basename
Diagnostic alignment tables now include the MD string in the Mismatches column when a mismatch pattern is specified
Minor optimization in AdarEditMatch to skip parsing the "0" character when it is used as a delimiter/flank of mismatch operations (i.e., when it doesn't represent a run of matches)

…ery sequence handling to the former approach. The query sequence, its start position, and its length are once again determined from the entire read sequence rather than from the aligned portion in the case of base clipping. Functionally, this commit has no effect; alignments that report clipped bases are rejected before reaching this point. However, these calls to Pysam's API cary significantly less overhead.

…ing a CIGAR string

AlexTate added 13 commits May 28, 2024 20:46

Adding the new *EditMatch classes to FeatureSelector.build_selectors(…

105b6ab

…). Stage 2 alignment now passes the entire alignment dictionary to *EditMatch classes so that edit pattern can be evaluated using the SEQ field and MD tag.

Adding the command line parameter to tiny-count, Run Configs, and CWL…

6cd3d5b

… for specifying edit pattern.

Adding unit tests for AdarEditMatch and TutEditMatch

3e1f12b

Minor style change. Makes more sense for the default case to be next …

26a8fc6

…to the lookup method

Adding a dummy class that inherits from two argparse formatters that …

d2905f0

…are required for the tiny-count help string.

Adding missing @sq headers to identity_choice_test.sam to appease Pys…

20658ac

…am since this otherwise isn't a target for tests. Adding more test cases to the NumericalMatch validation test. Finally, updating an unrelated decollapse test that had slipped through the cracks.

Updating documentation to reflect the new mismatch pattern parameter,…

0e9e7f7

… and the temporary removal of mismatch calculation from the cigar string

AlexTate requested a review from taimontgomery May 31, 2024 18:22

AlexTate added 5 commits June 10, 2024 12:19

Minor consistency change to have AlignmentReader throw the same excep…

45843cd

…tion type as AlignmentIter when checking the first alignment for missing NM tag

Minor optimization in AdarEditMatch. The character "0" is very abunda…

e72d5ae

…nt in MD strings. If it isn't preceded by another digit, skip the unnecessary work of parsing it into an int and incrementing the read position by 0.

Corrected handling of the line_num cdef attribute from Python space

5da6bd8

The MD string is now reported in the Mismatches column of diagnostic …

d409744

…alignment tables when a mismatch pattern is specified.

AlexTate added 3 commits July 8, 2024 13:34

Minor fix, renaming a variable whose name was misleading

e03a29a

Adding a sanity check for the unlikely case that an alignment is miss…

f5f099d

…ing a CIGAR string

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tiny-count: new edit pattern modes for the Mismatches selector #337

tiny-count: new edit pattern modes for the Mismatches selector #337

AlexTate commented May 31, 2024 •

edited

Loading

AlexTate commented Jun 10, 2024 •

edited

Loading

tiny-count: new edit pattern modes for the Mismatches selector #337

Are you sure you want to change the base?

tiny-count: new edit pattern modes for the Mismatches selector #337

Conversation

AlexTate commented May 31, 2024 • edited Loading

AlexTate commented Jun 10, 2024 • edited Loading

AlexTate commented May 31, 2024 •

edited

Loading

AlexTate commented Jun 10, 2024 •

edited

Loading