update dev #46

priyanka-surana · 2022-11-01T20:49:30Z

New modules structure
Some input/outputs were changed and resulting in downstream modifications
Added temporary gunzip functionality
Issues Update bed_filter #47, update bedtools_bamtobed config settings #48, N50 from GoaT not always available #49, Update create_table to deal with optional inputs #50, Remove bedtools_sort #51
Added chromosome and organelle information

PR checklist

priyanka-surana · 2022-11-24T21:11:17Z

Both unit tests and full tests successful locally. The final output table for a full run looks as below:

##Assembly_Information
Accession,GCA_934047225.1
Organism_Name,Ypsolopha sequella
ToL_ID,ilYpsSequ2
Taxon_ID,1870436
Assembly_Name,ilYpsSequ2.1
Assembly_Level,Chromosome
Life_Stage,adult
Tissue,WHOLE ORGANISM
Sex,NOT COLLECTED
##Assembly_Statistics
Total_Sequence,866863759
Chromosomes,30
Scaffolds,150
Scaffold_N50,28919749
Contigs,228
Contig_N50,20841000
GC_Percent,36
##Chromosome,Length,GC_Percent
1,34252299,35.5
...
29,16431280,36
Z,51978369,35.5
##Organelle,Length,GC_Percent
Mitochondrion,15291,19.5
##BUSCO
Lineage,lepidoptera_odb10
Summary,"C:98.0%[S:96.8%,D:1.2%],F:0.5%,M:1.5%,n:5286"
##MerquryFK,ilYpsSequ2
QV,65.2
Completeness,100.00

priyanka-surana · 2022-11-25T09:03:25Z

@muffato Once we figure out the correct sort command, I will make the changes.

priyanka-surana · 2022-11-25T10:15:40Z

bed sort is fixed and unit testing was completed successfully with only -k4.

muffato · 2022-11-25T10:38:17Z

(adding @BethYates so that she can comment on the output of the pipeline and how data are retrieved / computed)

bin/bed_filter.sh

bin/tol_input.sh

BethYates · 2022-11-25T14:35:09Z

Both unit tests and full tests successful locally. The final output table for a full run looks as below:

##Assembly_Information
Accession,GCA_934047225.1
Organism_Name,Ypsolopha sequella
ToL_ID,ilYpsSequ2
Taxon_ID,1870436
Assembly_Name,ilYpsSequ2.1
Assembly_Level,Chromosome
Life_Stage,adult
Tissue,WHOLE ORGANISM
Sex,NOT COLLECTED
##Assembly_Statistics
Total_Sequence,866863759
Chromosomes,30
Scaffolds,150
Scaffold_N50,28919749
Contigs,228
Contig_N50,20841000
GC_Percent,36
##Chromosome,Length,GC_Percent
1,34252299,35.5
...
29,16431280,36
Z,51978369,35.5
##Organelle,Length,GC_Percent
Mitochondrion,15291,19.5
##BUSCO
Lineage,lepidoptera_odb10
Summary,"C:98.0%[S:96.8%,D:1.2%],F:0.5%,M:1.5%,n:5286"
##MerquryFK,ilYpsSequ2
QV,65.2
Completeness,100.00

I think this has most of the information I expect to be able to get from the genome afterparty pipeline. There is also some information I wasn't expecting to be able to source from here but can certainly use for the genome note. The only thing I'm not seeing (other than BlobTool Kit data obviously) is the transcript mappability value, is this something this pipeline will produce?

priyanka-surana · 2022-11-25T17:31:55Z

I think this has most of the information I expect to be able to get from the genome afterparty pipeline. There is also some information I wasn't expecting to be able to source from here but can certainly use for the genome note. The only thing I'm not seeing (other than BlobTool Kit data obviously) is the transcript mappability value, is this something this pipeline will produce?

@muffato and I had a discussion on the topic of mappability and it is not going to be a number but rather a track on the contact maps. Yes that will be done for the genome using a different subworkflow #23. That would be genome mappability though not transcript mappability.

Transcript mappability requires RNAseq data from what I understand and we don't have enough of those yet to implement those in our pipeline, but eventually the same/similar workflow as above can be used to generate that.

Does that make sense?

bin/bed_filter.sh

subworkflows/local/genome_statistics.nf

bin/summary_table.py

muffato · 2022-11-25T22:52:21Z

bin/summary_table.py

+    for chrom in [[mol["assigned_molecule_location_type"], mol["length"], mol["gc_percent"]] for mol in seq if "gc_percent" in mol and mol["assembly_unit"] == "non-nuclear"]:
+        writer.writerow(chrom)


This could undergo a small rewrite like the other for+for+if line

bin/add_merqury.py

subworkflows/local/genome_statistics.nf

conf/test_full.config

bin/summary_table.py

Co-authored-by: Matthieu Muffato <[email protected]>

subworkflows/local/genome_statistics.nf

bin/summary_table.py

priyanka-surana · 2022-11-29T12:56:53Z

I think the most natural would be to have a single table-creation script that has some mandatory inputs (take genome_summary json and sequence_summary json) and some optional ones (merqury stats, and I would also make the busco json optional in this subworkflow). The summarytable.nf module would be updated accordingly.
It'd be like a funnel, summarising all the inputs that are present. This way, there would be a single output file for this subworkflow, no confusion about which of summary and table is the right output.
Also, retrospectively, it would guarantee that this file has the same format throughout (e.g. w.r.t. the line returns)

Moved back to a single table module which accepts optional inputs. Testing with all inputs and without merqury works. Cannot find an example for without busco.

subworkflows/local/genome_statistics.nf

muffato

The script is great, thanks !

modules/local/createtable.nf

bin/create_table.py

Co-authored-by: Matthieu Muffato <[email protected]>

muffato

For the record, create_table.nf was renamed to createtable.nf to adhere to the naming conventions.

update dev

0bd4469

priyanka-surana self-assigned this Nov 1, 2022

priyanka-surana added 2 commits November 1, 2022 21:24

update input tol module

2680191

update test config

4f2938f

This was linked to issues Nov 7, 2022

Update bed_filter #47

Closed

update bedtools_bamtobed config settings #48

Closed

N50 from GoaT not always available #49

Closed

Update create_table to deal with optional inputs #50

Closed

Remove bedtools_sort #51

Closed

priyanka-surana added 8 commits November 7, 2022 16:21

Update bed_filter #47

d07f240

Update bedtools_bamtobed config settings #48

253fa85

From goat to ncbi and update table #49 #50

86d1170

append merqury output to table #50

7336e82

add chrom and organelle info

ef0442b

Make common name optional #49 #50

169ad1a

Remove bedtools_sort #51

ef12627

update samtools modules

38b1db3

priyanka-surana marked this pull request as ready for review November 24, 2022 21:08

priyanka-surana requested a review from muffato November 25, 2022 09:02

Reverting to original bed sort #51

94e69c1

muffato requested a review from BethYates November 25, 2022 10:37

Changing subworkflows from serial to parallel

82e5607

priyanka-surana commented Nov 25, 2022

View reviewed changes

bin/bed_filter.sh Show resolved Hide resolved

priyanka-surana commented Nov 25, 2022

View reviewed changes

bin/tol_input.sh Show resolved Hide resolved

This was linked to issues Nov 28, 2022

Unnecessary sort #45

Closed

Subworkflow: table_statistics #27

Closed

muffato reviewed Nov 28, 2022

View reviewed changes

priyanka-surana and others added 7 commits November 28, 2022 10:18

remove extra modules and add full samplesheet

f369c52

break compound python statements

f1efeec

Co-authored-by: Matthieu Muffato <[email protected]>

Reorder join statements

e8dcf60

Co-authored-by: Matthieu Muffato <[email protected]>

modify join statements

abbb12c

break compound statements

307ee71

Co-authored-by: Matthieu Muffato <[email protected]>

break compound statements

6e5aa0c

change file open

68806e3

Co-authored-by: Matthieu Muffato <[email protected]>

priyanka-surana linked an issue Nov 28, 2022 that may be closed by this pull request

Organelles #25

Closed

priyanka-surana requested a review from muffato November 28, 2022 11:07

muffato reviewed Nov 28, 2022

View reviewed changes

subworkflows/local/genome_statistics.nf Outdated Show resolved Hide resolved

bin/summary_table.py Outdated Show resolved Hide resolved

priyanka-surana added 2 commits November 29, 2022 10:12

fixed typo

38fbb9e

create a single table

f4e195e

priyanka-surana commented Nov 29, 2022

View reviewed changes

subworkflows/local/genome_statistics.nf Show resolved Hide resolved

priyanka-surana requested a review from muffato November 29, 2022 15:00

Fix typo

20dd312

muffato reviewed Nov 30, 2022

View reviewed changes

modules/local/createtable.nf Outdated Show resolved Hide resolved

bin/create_table.py Outdated Show resolved Hide resolved

priyanka-surana and others added 2 commits November 30, 2022 13:28

Update pacbio sample def

a02332f

Co-authored-by: Matthieu Muffato <[email protected]>

version bump

78aa15b

Co-authored-by: Matthieu Muffato <[email protected]>

muffato approved these changes Nov 30, 2022

View reviewed changes

required flags

cb5c9bb

priyanka-surana merged commit eec15e8 into dev Nov 30, 2022

priyanka-surana deleted the update_dev branch November 30, 2022 16:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update dev #46

update dev #46

priyanka-surana commented Nov 1, 2022 •

edited

Loading

priyanka-surana commented Nov 24, 2022

priyanka-surana commented Nov 25, 2022

priyanka-surana commented Nov 25, 2022

muffato commented Nov 25, 2022

BethYates commented Nov 25, 2022

priyanka-surana commented Nov 25, 2022

muffato Nov 25, 2022

priyanka-surana commented Nov 29, 2022

muffato left a comment

muffato left a comment

		for chrom in [[mol["assigned_molecule_location_type"], mol["length"], mol["gc_percent"]] for mol in seq if "gc_percent" in mol and mol["assembly_unit"] == "non-nuclear"]:
		writer.writerow(chrom)

update dev #46

update dev #46

Conversation

priyanka-surana commented Nov 1, 2022 • edited Loading

PR checklist

priyanka-surana commented Nov 24, 2022

priyanka-surana commented Nov 25, 2022

priyanka-surana commented Nov 25, 2022

muffato commented Nov 25, 2022

BethYates commented Nov 25, 2022

priyanka-surana commented Nov 25, 2022

muffato Nov 25, 2022

Choose a reason for hiding this comment

priyanka-surana commented Nov 29, 2022

muffato left a comment

Choose a reason for hiding this comment

muffato left a comment

Choose a reason for hiding this comment

priyanka-surana commented Nov 1, 2022 •

edited

Loading