-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tiny-count: new edit pattern modes for the Mismatches selector #337
Open
AlexTate
wants to merge
21
commits into
master
Choose a base branch
from
issue-336
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…etailed_mismatch is True. Other major changes to AlignmentIter: - Soft clipping is now honored in the Seq, Start, and End alignment dictionary keys. - The constructor arg has_nm has been changed to expected_tags, which is just a tuple of the tags found in the first alignment. At construction this is checked for the NM tag, and if the new constructor arg detailed_mismatch is True, the MD tag is also checked. It's an error if these tags are missing but expected. - CIGAR parsing for edit distance is no longer supported until the method can be rewritten. The current implementation does not properly disambiguate M operations. For this reason the NM tag is now (temporarily) required. - Unmapped alignments are no longer skipped by recursively calling _next_alignment(). Python's max recursion depth is actually quite shallow, so a SAM/BAM file with 1000+ unmapped alignments would produce a RecursionError.
… command line argument, and the new arguments for AlignmentIter. If the NM tag is missing from the first alignment, this is now treated as an error (see previous commit). If the MD tag is missing and a mismatch pattern was specified, this is treated as an error.
- EditMatch: now used for the Mismatches selector rather than the generic NumericalMatch. This was necessary because we now have to pass the entire alignment dictionary into Mismatch selectors rather than just the NM value. - AdarEditMatch: evaluates the A->G mismatch pattern - TutEditMatch: evaluates the N->U 3' terminal uridylation mismatch pattern.
…). Stage 2 alignment now passes the entire alignment dictionary to *EditMatch classes so that edit pattern can be evaluated using the SEQ field and MD tag.
… for specifying edit pattern.
…or more than just length selection for a while now... Whitespace is now removed from the selector definition prior to validating and parsing it. This relaxes some formatting requirements for the definition, e.g., a range with spaces around the hyphen is now acceptable.
…ent now defaults to False. Since we are temporarily removing mismatch calculation from the CIGAR string, missing NM tags are now treated as an error. Previous commits on this branch only checked the first alignment for the NM tag.
…to the lookup method
…are required for the tiny-count help string.
…delimiter-surrounding whitespace from definitions before parsing. This allows more flexibility for definitions like "1 - 2" or "3 " which would have produced a potentially confusing error. Cleaning up the regex that validates NumericalMatch definitions (mainly removing the ambiguous back reference). If validation fails, the original definition is shown to the user rather than the whitespace-cleaned definition.
…am since this otherwise isn't a target for tests. Adding more test cases to the NumericalMatch validation test. Finally, updating an unrelated decollapse test that had slipped through the cracks.
… and the temporary removal of mismatch calculation from the cigar string
…, skip, and pad). If found, the exception includes the offending record number and the basename of the file. In this commit I'm also moving static methods out of AlignmentIter and into standalone cdef functions. For cleanup and better testability.
…tion type as AlignmentIter when checking the first alignment for missing NM tag
…nt in MD strings. If it isn't preceded by another digit, skip the unnecessary work of parsing it into an int and incrementing the read position by 0.
…alignment tables when a mismatch pattern is specified.
@taimontgomery
|
…ery sequence handling to the former approach. The query sequence, its start position, and its length are once again determined from the entire read sequence rather than from the aligned portion in the case of base clipping. Functionally, this commit has no effect; alignments that report clipped bases are rejected before reaching this point. However, these calls to Pysam's API cary significantly less overhead.
…ing a CIGAR string
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Rules that define a mismatch requirement can be extended to require a specific edit pattern. Two choices are available for this parameter:
This option applies globally to all rules except those that lack a mismatch requirement. Rules without the requirement will continue to allow any number of mismatches following any edit pattern.
This PR contains the following additional changes:
Soft clipped bases are now honored and reflected in the sequence (and therefore 5' NT), start, and length.(Edit 6/10: after further discussion, there is more work to be done in reconciling inconsistencies across selectors when using this approach. This might be revisited in a future issue.)Length
andMismatches
selectors, are more forgiving when parsing and validating definitions. Leading and trailing whitespace is now permissible, as well as variable whitespace around commas and hyphens. Definitions composed entirely of whitespace will continue to fail validation.