Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move accession to the first column of metadata_all.tsv #36

Merged
merged 2 commits into from
Feb 24, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion phylogenetic/rules/merge_sequences_usvi.smk
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,13 @@ This part of the workflow usually includes the following steps:
"""

rule append_usvi:
"""Appending USVI sequences"""
"""Appending USVI sequences

Notable columns:
- accession: Either the GenBank accession or USVI accession.
- genbank_accession: GenBank accession for Auspice to generate a URL to the NCBI GenBank record. Empty for USVI sequences.
- url: URL used in Auspice, to either link to the USVI github repo (https://github.com/blab/zika-usvi/) or link to the NCBI GenBank record ('https://www.ncbi.nlm.nih.gov/nuccore/*')
"""
input:
sequences = "data/sequences.fasta",
metadata = "data/metadata.tsv",
Expand All @@ -43,5 +49,6 @@ rule append_usvi:
-n accession \
-e '$genbank_accession' \
| csvtk concat -tl - {input.usvi_metadata} \
| tsv-select -H -f accession --rest last \
Comment on lines 51 to +52
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(non-blocking)

Suggestion: In cases like this where a column name is ambiguous, add more detail somewhere in the repo. Maybe as the docstring of this rule:

rule append_usvi:
    """Appending USVI sequences.

    Notable columns:
    - accession: Either the GenBank accession or USVI accession.
    - genbank_accession: For Auspice to generate a URL to the NCBI GenBank record. Empty for USVI sequences.
    - url: ?
    """
    input:
        …

I don't know if this has been done in other repos, but it seems like it'd be useful to bring this context out of commit messages/PRs and into the code itself.

Copy link
Contributor Author

@j23414 j23414 Feb 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, agree with writing the context into the code itself (docstring). Fixed with 3631e90 but let me know if the url explanation is confusing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Discussion is beyond the scope of this PR now, so feel free to merge first.)

The Auspice source code indicates that when both are specified, genbank_accession takes precedence and url will be ignored.

Because of this behavior, I don't think url should be set for GenBank sequences. That would only cause confusion in the event that GenBank changes the URL, we try to update it in this file, and scratch our heads over why the old URL is still showing on Auspice. I would expect url to only be set for USVI sequences where there is no genbank_accession.

> {output.metadata}
"""