-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seqkit #59
base: dev
Are you sure you want to change the base?
Seqkit #59
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution!
Ultimately, the pipeline will allow to flexibly choose the tools that are run, so having more tools in the pipeline is always great, but what exactly was your rationale here?
Admittedly, I just skimmed over the description, but to me, it seems that it is predominantly meant to run on FASTA files and for judging the quality of genome assemblies. Unless I missed some relevant arguments, its application on sequencing reads is in my opinion does not really yield too many meaningful statistics - at least for Illumina, for Nanopore probably yes.
Do you happen to have a Nanopore example? For Illumina, this is basically all you get:
file format type num_seqs sum_len min_len avg_len max_len
S11_L001_R1_001.fastq.gz FASTQ DNA 35875355 5417178605 151 151.0 151
S11_L001_R1_001.fastq.gz FASTQ DNA 35875355 322878195 9 9.0 9
S11_L001_R2_001.fastq.gz FASTQ DNA 35875355 5417178605 151 151.0 151
But with regard to the code, this already looks very good and tidy!
I am only missing a publishDir
directive in the config, so that the reports from the stats output channel are published in a subfolder of the outdir
.
@@ -22,6 +22,10 @@ process { | |||
ext.args = '--quiet' | |||
} | |||
|
|||
withName: SEQKIT_STATS { | |||
ext.args = '' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you not want to publish the results as well in the outdir
? Or only MultiQC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah it is defined here - I guess you would only overwrite it if you dont want your output published in the standard way.
seqinspector/conf/modules.config
Line 15 in 9cd232c
publishDir = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oooh fancy - a central publishDir
directive for all modules ?!? That's smart and has totally escaped me...sorry then for the false accusations!
// | ||
// MODULE: Run SEQKIT_STATS | ||
// | ||
SEQKIT_STATS ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, you are neither mixing the output into the MultiQC channel nor publishing it in the outdir
. Unless I am missing something, the results of the run are therefore not used at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah am outputting them
docs/output.md
Outdated
|
||
</details> | ||
|
||
[SeqkitStats](https://bioinf.shenwei.me/seqkit/usage/#stats) it gives general quality metrics about your sequenced reads including average read lengths, GC(%) and n50's. For further reading and documentation see the [Seqkit help pages]([Seqkit help](https://bioinf.shenwei.me/seqkit/)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is true that SeqkitStats computes some quality metrics, but to my best knowledge it is more useful for FASTA files and genomic assemblies than sequencing reads?
For example, for an Illumina run, an average read length prior to trimming should be known, because it corresponds to the number of cycles. For Nanopore, admittedly, such a statistic is more useful, so you may want to structure the documentation accordingly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated text a little, should i specifically mention this is more useful for nanopore data? for ilumina etc you do still get n50's etc but agree it is less useful - want me to add a nanopore test? :D
file format type num_seqs sum_len min_len avg_len max_len Q1 Q2 Q3 sum_gap N50 N50_num Q20(%) Q30(%) AvgQual GC(%)
sample1_R1.fastq.gz FASTQ DNA 1377513 411872902 35 299.0 301 300.0 301.0 301.0 0 301 1 99.10 97.54 29.31 38.53
sample1_R2.fastq.gz FASTQ DNA 1377513 411840994 35 299.0 301 300.0 301.0 301.0 0 301 1 97.11 93.54 25.78 38.54
Add seqkit stats nf-core module
PR checklist
nf-core lint
).nf-test test main.nf.test -profile test,docker
).nextflow run . -profile debug,test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).