Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest with nextclade #62

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Ingest with nextclade #62

wants to merge 5 commits into from

Conversation

joverlee521
Copy link
Contributor

Description of proposed changes

Runs Nextclade as part of the ingest workflow so that we get Nextclade clade annotations for all H5 HA sequences.
Uses the community/moncla-lab/iav-h5/ha/all-clades Nextclade dataset.

Related issue(s)

Resolves #44

Checklist

  • Checks pass
  • Trial fauna ingest
    -> results at s3://nextstrain-data-private/files/workflows/avian-flu/trial/ingest-with-nextclade/metadata.tsv.zst
  • Trial NCBI ingest -> results at s3://nextstrain-data/files/workflows/avian-flu/trial/ingest-with-nextclade/h5n1/metadata.tsv.zst

joverlee521 added a commit that referenced this pull request Jun 24, 2024
Motivated by my own need to test the ingest workflows for the latest
addition of Nextclade outputs in #62.
@joverlee521 joverlee521 force-pushed the ingest-with-nextclade branch from 58df478 to f33157f Compare June 24, 2024 20:20
Using `community/moncla-lab/iav-h5/ha/all-clades` as the default
Nextclade dataset since it works across fauna and NCBI data.

Subsequent commits will join these rules with the full ingest
workflows.
Using the nextclade_field_map that's currently used in the measles
ingest workflow.¹ We can cut down on the columns used if they are not
useful for avian flu.

¹ <https://github.com/nextstrain/measles/blob/957fc744c64b8f5a722b5c525687d0746755add6/ingest/defaults/nextclade_field_map.tsv>
We are not using the alignment.fasta anywhere and I don't think
it makes sense to only upload alignment for the HA segment.
Keep a copy of the full Nextclade TSV output from ingest on S3
since we won't necessarily join all columns with the metadata output.
@joverlee521 joverlee521 force-pushed the ingest-with-nextclade branch from f33157f to 8adda40 Compare June 24, 2024 20:22
Comment on lines +8 to +28
coverage coverage
totalMissing missing_data
totalSubstitutions divergence
totalNonACGTNs nonACGTN
qc.overallStatus QC_overall
qc.missingData.status QC_missing_data
qc.mixedSites.status QC_mixed_sites
qc.privateMutations.status QC_rare_mutations
qc.snpClusters.status QC_snp_clusters
qc.frameShifts.status QC_frame_shifts
qc.stopCodons.status QC_stop_codons
frameShifts frame_shifts
privateNucMutations.reversionSubstitutions private_reversion_substitutions
privateNucMutations.labeledSubstitutions private_labeled_substitutions
privateNucMutations.unlabeledSubstitutions private_unlabeled_substitutions
privateNucMutations.totalReversionSubstitutions private_total_reversion_substitutions
privateNucMutations.totalLabeledSubstitutions private_total_labeled_substitutions
privateNucMutations.totalUnlabeledSubstitutions private_total_unlabeled_substitutions
privateNucMutations.totalPrivateSubstitutions private_total_private_substitutions
qc.snpClusters.clusteredSNPs private_snp_clusters
qc.snpClusters.totalSNPs private_total_snp_clusters
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we only keep the clade assignment and drop all of these other columns? These QC outputs are specific to the HA segment so it might not make sense to keep as part of the overall metadata.

@joverlee521 joverlee521 requested a review from a team June 24, 2024 20:25
@joverlee521 joverlee521 requested a review from lmoncla June 24, 2024 20:53
# Nextclade can have pathogen specific output columns so make sure to check which
# columns would be useful for your downstream phylogenetic analysis.
seqName seqName
clade clade
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to include the polybasic_cleavage_site output column or should the phylogenetic builds continue to rely on scripts/annotate-ha-cleavage-site.py?

Comment on lines +68 to +72
{
"key": "clade",
"title": "Nextclade Clade",
"type": "categorical"
},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the Nextclade Clade as a separate coloring so we can do comparisons across clade labels, but maybe we'll remove h5_label_clade eventually? Would love to hear your thoughts here @lmoncla 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ingest: Run Nextclade as part of ingest
2 participants