Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update dev #46

Merged
merged 26 commits into from
Nov 30, 2022
Merged

update dev #46

merged 26 commits into from
Nov 30, 2022

Conversation

priyanka-surana
Copy link
Contributor

@priyanka-surana priyanka-surana commented Nov 1, 2022

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
    • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
    • If necessary, also make a PR on the nf-core/genomenote branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@priyanka-surana priyanka-surana self-assigned this Nov 1, 2022
@priyanka-surana priyanka-surana marked this pull request as ready for review November 24, 2022 21:08
@priyanka-surana
Copy link
Contributor Author

Both unit tests and full tests successful locally. The final output table for a full run looks as below:

##Assembly_Information
Accession,GCA_934047225.1
Organism_Name,Ypsolopha sequella
ToL_ID,ilYpsSequ2
Taxon_ID,1870436
Assembly_Name,ilYpsSequ2.1
Assembly_Level,Chromosome
Life_Stage,adult
Tissue,WHOLE ORGANISM
Sex,NOT COLLECTED
##Assembly_Statistics
Total_Sequence,866863759
Chromosomes,30
Scaffolds,150
Scaffold_N50,28919749
Contigs,228
Contig_N50,20841000
GC_Percent,36
##Chromosome,Length,GC_Percent
1,34252299,35.5
...
29,16431280,36
Z,51978369,35.5
##Organelle,Length,GC_Percent
Mitochondrion,15291,19.5
##BUSCO
Lineage,lepidoptera_odb10
Summary,"C:98.0%[S:96.8%,D:1.2%],F:0.5%,M:1.5%,n:5286"
##MerquryFK,ilYpsSequ2
QV,65.2
Completeness,100.00

@priyanka-surana
Copy link
Contributor Author

@muffato Once we figure out the correct sort command, I will make the changes.

@priyanka-surana
Copy link
Contributor Author

bed sort is fixed and unit testing was completed successfully with only -k4.

@muffato muffato requested a review from BethYates November 25, 2022 10:37
@muffato
Copy link
Member

muffato commented Nov 25, 2022

(adding @BethYates so that she can comment on the output of the pipeline and how data are retrieved / computed)

@BethYates
Copy link
Collaborator

Both unit tests and full tests successful locally. The final output table for a full run looks as below:

##Assembly_Information
Accession,GCA_934047225.1
Organism_Name,Ypsolopha sequella
ToL_ID,ilYpsSequ2
Taxon_ID,1870436
Assembly_Name,ilYpsSequ2.1
Assembly_Level,Chromosome
Life_Stage,adult
Tissue,WHOLE ORGANISM
Sex,NOT COLLECTED
##Assembly_Statistics
Total_Sequence,866863759
Chromosomes,30
Scaffolds,150
Scaffold_N50,28919749
Contigs,228
Contig_N50,20841000
GC_Percent,36
##Chromosome,Length,GC_Percent
1,34252299,35.5
...
29,16431280,36
Z,51978369,35.5
##Organelle,Length,GC_Percent
Mitochondrion,15291,19.5
##BUSCO
Lineage,lepidoptera_odb10
Summary,"C:98.0%[S:96.8%,D:1.2%],F:0.5%,M:1.5%,n:5286"
##MerquryFK,ilYpsSequ2
QV,65.2
Completeness,100.00

I think this has most of the information I expect to be able to get from the genome afterparty pipeline. There is also some information I wasn't expecting to be able to source from here but can certainly use for the genome note. The only thing I'm not seeing (other than BlobTool Kit data obviously) is the transcript mappability value, is this something this pipeline will produce?

@priyanka-surana
Copy link
Contributor Author

I think this has most of the information I expect to be able to get from the genome afterparty pipeline. There is also some information I wasn't expecting to be able to source from here but can certainly use for the genome note. The only thing I'm not seeing (other than BlobTool Kit data obviously) is the transcript mappability value, is this something this pipeline will produce?

@muffato and I had a discussion on the topic of mappability and it is not going to be a number but rather a track on the contact maps. Yes that will be done for the genome using a different subworkflow #23. That would be genome mappability though not transcript mappability.

Transcript mappability requires RNAseq data from what I understand and we don't have enough of those yet to implement those in our pipeline, but eventually the same/similar workflow as above can be used to generate that.

Does that make sense?

This was linked to issues Nov 28, 2022
bin/bed_filter.sh Show resolved Hide resolved
subworkflows/local/genome_statistics.nf Show resolved Hide resolved
bin/summary_table.py Outdated Show resolved Hide resolved
bin/summary_table.py Outdated Show resolved Hide resolved
Comment on lines 59 to 60
for chrom in [[mol["assigned_molecule_location_type"], mol["length"], mol["gc_percent"]] for mol in seq if "gc_percent" in mol and mol["assembly_unit"] == "non-nuclear"]:
writer.writerow(chrom)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could undergo a small rewrite like the other for+for+if line

bin/add_merqury.py Outdated Show resolved Hide resolved
subworkflows/local/genome_statistics.nf Outdated Show resolved Hide resolved
conf/test_full.config Show resolved Hide resolved
bin/summary_table.py Outdated Show resolved Hide resolved
@priyanka-surana priyanka-surana linked an issue Nov 28, 2022 that may be closed by this pull request
subworkflows/local/genome_statistics.nf Outdated Show resolved Hide resolved
bin/summary_table.py Outdated Show resolved Hide resolved
@priyanka-surana
Copy link
Contributor Author

I think the most natural would be to have a single table-creation script that has some mandatory inputs (take genome_summary json and sequence_summary json) and some optional ones (merqury stats, and I would also make the busco json optional in this subworkflow). The summarytable.nf module would be updated accordingly.
It'd be like a funnel, summarising all the inputs that are present. This way, there would be a single output file for this subworkflow, no confusion about which of summary and table is the right output.
Also, retrospectively, it would guarantee that this file has the same format throughout (e.g. w.r.t. the line returns)

Moved back to a single table module which accepts optional inputs. Testing with all inputs and without merqury works. Cannot find an example for without busco.

Copy link
Member

@muffato muffato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script is great, thanks !

modules/local/createtable.nf Outdated Show resolved Hide resolved
bin/create_table.py Outdated Show resolved Hide resolved
priyanka-surana and others added 2 commits November 30, 2022 13:28
Co-authored-by: Matthieu Muffato <[email protected]>
Co-authored-by: Matthieu Muffato <[email protected]>
Copy link
Member

@muffato muffato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, create_table.nf was renamed to createtable.nf to adhere to the naming conventions.

@priyanka-surana priyanka-surana merged commit eec15e8 into dev Nov 30, 2022
@priyanka-surana priyanka-surana deleted the update_dev branch November 30, 2022 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants