Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: infer library type relationship #157

Merged
merged 15 commits into from
Jan 21, 2024
Merged

fix: infer library type relationship #157

merged 15 commits into from
Jan 21, 2024

Conversation

balajtimate
Copy link
Collaborator

@balajtimate balajtimate commented Jan 9, 2024

Description

  • Add check to assign first_mate and second_mate when it can correctly be inferred from seq_id's (when seq_ids are not empty)
  • Add check to only align when library source is either inferred or given by --tax-id argument (in case of library type as well as read orientation inference)
  • Add --outSAMorder PairedKeepInputOrder to mapping.py to keep input order of reads when running STAR with multiple threads
  • Update tests

In the case of paired-end libraries, the logic should be:

  1. For the library type inference, check seq_ids of first and second pair, to decide if they fit the Casava format, and the lib type and relationship can be determined
  2. If not, as in most SRA samples, align them separately to reference transcripts, but only if library_source is known
  3. Compare alignments and decide relationship
  4. For the orientation inference, based on the library type relationship determined:
    1. If not_mates or not_available, infer the orientation from the separately aligned results
    2. If split_mates, align the samples in paired-end mode, but only if library_source is known

Fixes #153

Type of change

  • New feature (non-breaking change which adds functionality)

Checklist

Please carefully read these items and tick them off if the statements are true
or do not apply.

  • I have performed a self-review of my own code
  • My code follows the existing coding style, lints and generates no new
    warnings
  • I have added type annotations to all function/method signatures, and I
    have added type annotations for any local variables that are non-trivial,
    potentially ambiguous or might otherwise benefit from explicit typing.
  • I have commented my code in hard-to-understand areas
  • I have added ["Google-style docstrings"] to all new modules, classes,
    methods/functions or updated previously existing ones
  • I have added tests that prove my fix is effective or that my feature
    works
  • New and existing unit tests pass locally with my changes and I have not
    reduced the code coverage relative to the previous state
  • I have updated any sections of the app's documentation that are affected
    by the proposed changes

If for some reason you are unable to tick off all boxes, please leave a
comment explaining the issue you are facing so that we can work on it
together.

Copy link

codecov bot commented Jan 9, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (930b741) 100.00% compared to head (8fe0cbe) 100.00%.

Additional details and impacted files
@@            Coverage Diff            @@
##               dev      #157   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           13        13           
  Lines         1109      1131   +22     
=========================================
+ Hits          1109      1131   +22     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@balajtimate balajtimate requested a review from uniqueg January 9, 2024 10:04
Copy link
Member

@uniqueg uniqueg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me (minus the very few minor comments), but I have to be honest in that it is a bit hard to follow the code in the library type model since the introduction of the mappings. I am pretty confident given the extensive tests, but I propose that you play this through with some small real world examples as well to check if this really behaves as desired from end to end.

htsinfer/get_library_type.py Outdated Show resolved Hide resolved
htsinfer/get_library_type.py Outdated Show resolved Hide resolved
htsinfer/get_library_type.py Outdated Show resolved Hide resolved
htsinfer/get_library_type.py Outdated Show resolved Hide resolved
htsinfer/get_library_type.py Outdated Show resolved Hide resolved
htsinfer/get_read_orientation.py Outdated Show resolved Hide resolved
@balajtimate
Copy link
Collaborator Author

To summarize: refactored _align_mates in get_library_type.py to correctly determine the number of aligned reads as well as the number of concordant reads.

When iterating through the reads, the problem during _compare_alignments was that the order of the reference names were scrambled. Turns out, this was the result of STAR: it outputs the aligned reads in the same order as the input read, except when running it on multiple threads, which I have been doing since day 1. When I reran the samples on the default 1 thread, the output of the alignment matched between the two samples, and the concordant reads were calculated correctly. So one solution was to sort reads by ref name using pysam, which outputs an extra sorted BAM file, but ultimately updating the STAR command with --outSAMorder PairedKeepInputOrder seems to solve this, so there is no extra sorting needed, the output will be in the same order as input (it's also possible to sort the aligned reads with STAR, but only according to coordinates, and I think it'd be better to have sorted according to ref names. Btw, is there any advantage in this case for using SAMs instead of BAMs? I think mapping.py could be updated to output BAMs, maybe in a separate PR).

For checking the ratio between the concordant and the aligned reads (for _update_relationship), I choose the lowest of the two alignments. Also, I'll run it on more samples and check the ratio, but I think we could lower it a bit, like 90% (instead of 95% default now) for the ratio / aligned reads to be considered paired end library.

uniqueg
uniqueg previously approved these changes Jan 18, 2024
Copy link
Member

@uniqueg uniqueg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, but please take care of the two comments first 🙏

htsinfer/mapping.py Show resolved Hide resolved
htsinfer/get_library_type.py Show resolved Hide resolved
@uniqueg uniqueg changed the title feat: infer library type relationship from aligned reads only fix: infer library type relationship Jan 18, 2024
@uniqueg
Copy link
Member

uniqueg commented Jan 18, 2024

Oh, and please update the PR description to reflect all of the latest changes. As far as I understood, you also fixed some things that never really worked.

@balajtimate balajtimate merged commit 7b65c43 into dev Jan 21, 2024
19 checks passed
@balajtimate balajtimate deleted the fix_sra_mates branch January 21, 2024 20:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: only map when lib source is given, argument to force mapping otherwise
2 participants