Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] pe correction, low_complexity_filter & abund_cutoff as config params #107

Closed
wants to merge 13 commits into from
Closed

Conversation

mr-eyes
Copy link
Member

@mr-eyes mr-eyes commented Jul 22, 2021

These are just suggestions to speed up the trimming step as it seems to be very slow. I don't know the implications though...

cc @bluegenes

These are just suggestions to speed up the trimming step as it seems to be very slow.
genome_grist/conf/Snakefile Outdated Show resolved Hide resolved
@ctb
Copy link
Member

ctb commented Jul 23, 2021

I'm not capable of reviewing the fastp change without a lot of effort, so I'll have to leave the actual review up to @bluegenes.

@mr-eyes have you observed any performance improvement?

@mr-eyes
Copy link
Member Author

mr-eyes commented Jul 23, 2021

Yes, the fastp step is now just slow after it was tooooo slow.

@bluegenes
Copy link
Member

bluegenes commented Jul 23, 2021

@mr-eyes do you think you could time a fastp run on one of your samples with both the previous version and your updated version so we have some numbers? If would be really awesome to have the runtime and json info of read trimming as a comment here for future info!

Copy link
Member

@bluegenes bluegenes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes make sense for making things faster.

Would be really great to have at least one timed comparison, with corresponding info on the difference in # trimmed reads!

otherwise lgtm!

@mr-eyes
Copy link
Member Author

mr-eyes commented Jul 23, 2021

I will work on it soon and report. Thanks @bluegenes

@bluegenes
Copy link
Member

bluegenes commented Jan 18, 2022

@mr-eyes is this ready to get merged in? Or did you make any additional modifications?

Would prefer not to slow folks down if they come try genome-grist after @ctb's blog post :)

@mr-eyes mr-eyes changed the title [MRG] speed up adapter trimming [WIP] speed up adapter trimming Jan 19, 2022
@mr-eyes mr-eyes changed the title [WIP] speed up adapter trimming [WIP] pe correction, low_complexity_filter & abund_cutoff as config params Jan 19, 2022
@mr-eyes
Copy link
Member Author

mr-eyes commented Jan 20, 2022

Thanks, @bluegenes, for the ping.

I updated the PR description with more flexible changes. It's ready for review now, and I will post here later benchmark details.

@mr-eyes mr-eyes changed the title [WIP] pe correction, low_complexity_filter & abund_cutoff as config params [MRG] pe correction, low_complexity_filter & abund_cutoff as config params Jan 20, 2022
doc/configuring.md Outdated Show resolved Hide resolved

## Error trimming flags
# fastp_correction: set to ON or 1 for base correction for PE data
fastp_correction: OFF
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use YAML booleans - it looks like OFF is a valid one (per https://yaml.org/type/bool.html), but your code below doesn't use boolean value checking.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that checking for boolean?

correction_flag = "--correction" if config_correction in [True, '1'] else ''

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the 'if' here worries me - if it's a boolean, then you should just be able to say if config_correction. Most specifically, '1' should not be magic (which it seems to be here). I don't really know how snakemake's config yaml parser interprets valid vs invalid YAML boolean values, but I would be disturbed to see 'yes' interpreted as false here.

YAML doesn't seem to have a simple answer for what is a valid bool - this is a mess! -

 y|Y|yes|Yes|YES|n|N|no|No|NO
|true|True|TRUE|false|False|FALSE
|on|On|ON|off|Off|OFF

I guess the simplest thing to do would be to insist that it be a valid bool (True or False), not a '1' or anything else, and complain if it's not.

(I'm disgruntled with YAML because of this kind of thing - Luiz mentioned an increasing preference for TOML, because YAML isn't really that well specified.)

doc/configuring.md Outdated Show resolved Hide resolved
@ctb
Copy link
Member

ctb commented Jan 20, 2022

hi @mr-eyes I left some comments, looking good overall!

main requests other than that -

  • please update the PR description (top comment) to include a summary of the changes
  • I would very much like to see some minimal speed benchmarking on at least one data set before merging. Doesn't need to be big - something like SRR5650070 (p880mo11) might be good.

mr-eyes and others added 2 commits January 22, 2022 10:43
Co-authored-by: C. Titus Brown <[email protected]>
Co-authored-by: C. Titus Brown <[email protected]>
@mr-eyes
Copy link
Member Author

mr-eyes commented Jan 22, 2022

I did some benchmarks and found no significant change in time/memory. So while it is good to have a flexible configuration, this does not speed up genome-grist the way I expected. However, I think it can affect the performance depending on the size/quality of the dataset.

Wanted to ask, was there a specific criterion for selecting the abundtrim and trimming parameters? I can't imagine how it will biologically affect the results.

@ctb
Copy link
Member

ctb commented Jan 22, 2022

Wanted to ask, was there a specific criterion for selecting the abundtrim and trimming parameters? I can't imagine how it will biologically affect the results.

did you take a look at https://peerj.com/preprints/890/?

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants