Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tiny-count: new edit pattern modes for the Mismatches selector #337

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

AlexTate
Copy link
Member

@AlexTate AlexTate commented May 31, 2024

Rules that define a mismatch requirement can be extended to require a specific edit pattern. Two choices are available for this parameter:

  • ADAR: all mismatches must follow the A → I edit pattern which is characteristic of the double-stranded RNA-specific adenosine deaminase (ADAR) enzyme family. Inosene is recognized as guanosine by reverse transcriptase and therefore represented as G when sequenced, so this pattern is represented as A → G in sequencing data.
  • TUT: all mismatches must follow the N → U edit pattern at the 3' terminus which is characteristic of the Terminal Uridylyl Transferase (TUT) enzyme family. Mismatches must be consecutive from the 3' end. Reverse transcription prior to sequencing means this pattern is represented as N → T in sequencing data.

This option applies globally to all rules except those that lack a mismatch requirement. Rules without the requirement will continue to allow any number of mismatches following any edit pattern.

This PR contains the following additional changes:

  • Mismatches, for now, can no longer be calculated from the CIGAR string. All alignments must therefore report an NM tag, and if the tag is missing, it is treated as an error. The prior CIGAR method failed to disambiguate the M match operator.
  • Soft clipped bases are now honored and reflected in the sequence (and therefore 5' NT), start, and length.(Edit 6/10: after further discussion, there is more work to be done in reconciling inconsistencies across selectors when using this approach. This might be revisited in a future issue.)
  • Selectors that use the NumericalMatch class, including the Length and Mismatches selectors, are more forgiving when parsing and validating definitions. Leading and trailing whitespace is now permissible, as well as variable whitespace around commas and hyphens. Definitions composed entirely of whitespace will continue to fail validation.
  • Unmapped alignments are no longer skipped by recursively calling _next_alignment() in the AlignmentIter class. Python's max recursion depth is actually quite shallow (1000), so a SAM/BAM file with 1000+ unmapped alignments would have produced a RecursionError.

AlexTate added 13 commits May 28, 2024 20:46
…etailed_mismatch is True.

Other major changes to AlignmentIter:
- Soft clipping is now honored in the Seq, Start, and End alignment dictionary keys.
- The constructor arg has_nm has been changed to expected_tags, which is just a tuple of the tags found in the first alignment. At construction this is checked for the NM tag, and if the new constructor arg detailed_mismatch is True, the MD tag is also checked. It's an error if these tags are missing but expected.
- CIGAR parsing for edit distance is no longer supported until the method can be rewritten. The current implementation does not properly disambiguate M operations. For this reason the NM tag is now (temporarily) required.
- Unmapped alignments are no longer skipped by recursively calling _next_alignment(). Python's max recursion depth is actually quite shallow, so a SAM/BAM file with 1000+ unmapped alignments would produce a RecursionError.
… command line argument, and the new arguments for AlignmentIter.

If the NM tag is missing from the first alignment, this is now treated as an error (see previous commit). If the MD tag is missing and a mismatch pattern was specified, this is treated as an error.
- EditMatch: now used for the Mismatches selector rather than the generic NumericalMatch. This was necessary because we now have to pass the entire alignment dictionary into Mismatch selectors rather than just the NM value.
- AdarEditMatch: evaluates the A->G mismatch pattern
- TutEditMatch: evaluates the N->U 3' terminal uridylation mismatch pattern.
…). Stage 2 alignment now passes the entire alignment dictionary to *EditMatch classes so that edit pattern can be evaluated using the SEQ field and MD tag.
…or more than just length selection for a while now...

Whitespace is now removed from the selector definition prior to validating and parsing it. This relaxes some formatting requirements for the definition, e.g., a range with spaces around the hyphen is now acceptable.
…ent now defaults to False. Since we are temporarily removing mismatch calculation from the CIGAR string, missing NM tags are now treated as an error. Previous commits on this branch only checked the first alignment for the NM tag.
…are required for the tiny-count help string.
…delimiter-surrounding whitespace from definitions before parsing. This allows more flexibility for definitions like "1 - 2" or "3 " which would have produced a potentially confusing error.

Cleaning up the regex that validates NumericalMatch definitions (mainly removing the ambiguous back reference). If validation fails, the original definition is shown to the user rather than the whitespace-cleaned definition.
…am since this otherwise isn't a target for tests.

Adding more test cases to the NumericalMatch validation test.

Finally, updating an unrelated decollapse test that had slipped through the cracks.
… and the temporary removal of mismatch calculation from the cigar string
@AlexTate AlexTate requested a review from taimontgomery May 31, 2024 18:22
AlexTate added 5 commits June 10, 2024 12:19
…, skip, and pad). If found, the exception includes the offending record number and the basename of the file.

In this commit I'm also moving static methods out of AlignmentIter and into standalone cdef functions. For cleanup and better testability.
…tion type as AlignmentIter when checking the first alignment for missing NM tag
…nt in MD strings. If it isn't preceded by another digit, skip the unnecessary work of parsing it into an int and incrementing the read position by 0.
…alignment tables when a mismatch pattern is specified.
@AlexTate
Copy link
Member Author

AlexTate commented Jun 10, 2024

@taimontgomery
Additional changes:

  • Alignments are rejected if they report incompatible CIGAR operators (soft clip, hard clip, skip, and pad). If found, the exception includes the offending record number and the file's basename
  • Diagnostic alignment tables now include the MD string in the Mismatches column when a mismatch pattern is specified
  • Minor optimization in AdarEditMatch to skip parsing the "0" character when it is used as a delimiter/flank of mismatch operations (i.e., when it doesn't represent a run of matches)

AlexTate added 3 commits July 8, 2024 13:34
…ery sequence handling to the former approach. The query sequence, its start position, and its length are once again determined from the entire read sequence rather than from the aligned portion in the case of base clipping.

Functionally, this commit has no effect; alignments that report clipped bases are rejected before reaching this point. However, these calls to Pysam's API cary significantly less overhead.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant