H9Nx (all lineages), H9Nx (Y-lineage), H9Nx (G-lineage), H9Nx (B-lineage) #1552

jurresiegers · 2024-11-18T07:21:39Z

Hi all,

Would it be possible to get a H9Nx Nextclade build up and running based on the recently published H9 nomenclature paper? This paper included reference datasets (see Appendix 2 and 3) from GISAID/NCBI for all lineages and specific sub lineages.

https://wwwnc.cdc.gov/eid/article/30/8/23-1176_article

Best,
Jurre Siegers

ivan-aksamentov · 2024-11-18T07:31:09Z

Hi @jurresiegers

There's been some discussion in this topic: #870 (comment)

This could change, but currently I am not aware of any concrete plans on Nextstrain team to prepare datasets on this particular topic.

Community contributions are very welcome! Dataset author documentation is here: https://github.com/nextstrain/nextclade_data

jurresiegers · 2024-11-18T07:35:28Z

Thanks Ivan! I will follow up on that topic :)

ivan-aksamentov · 2024-11-18T07:38:24Z

@jurresiegers I think it's better to continue here, because that issue was for a different reason and also it is closed. I'll invite people from there to here.

AMPByrne · 2024-11-18T17:38:44Z

@ivan-aksamentov and @jurresiegers I've now got a working dataset but have only been able to test on around 300 sequences. Is there any guidance on what's considered adequate testing before submitting datasets?

ivan-aksamentov · 2024-11-18T19:21:12Z

@AMPByrne There are no particular established criteria - every virus is different.

You could submit a pull request to the data repo, and also give a link to your source repo, where you prepare the dataset, so that other people could test the dataset(s) as well. And then the community can decide if it's any good. And if not, they could suggest improvements. They could also comment in your source repo and submit proposals or fixes there.

The usual points which are discussed in these situations are the choice of reference sequence, sampling of the sequences for reference tree, QC config, how to subdivide datasets if there are multiple distant strains, dataset (path) naming etc.

lmoncla · 2024-11-20T22:16:05Z

@AMPByrne this generally sounds great, and am happy for you to take the lead on the H9 dataset if you are so inclined, and are already doing it! We are about to put a manuscript describing our approach for the H5 datasets on bioRxiV, but would be happy to share it with you via email if you'd like to see what we did. We found that the clade calls tend to be better with more data, so I'd suggest maximizing the number of sequences you include. We also wanted to make sure that we were assigning things according to the established clades by WHO/FAO/WOAH, so we acquired a reference set from them, identified clade-defining nodes, and then tested performance of NextClade calls against LABEL using all H5 data that we maintain for Nextstrain purposes (which was about 20,000 sequences that were not in the reference set). That was our general approach, and we plan to maintain these H5 ones and continually work to improve them and keep them up to date. Generally are happy to help/collaborate with you, though our current bandwidth is a bit limited, so we may not be able to directly work on this in the next couple of months.

AMPByrne · 2024-11-29T17:26:02Z

@lmoncla thanks for the guidance, that's really helpful! Fortunately, the manuscript describing the new lineage system provides pilot datasets to make phylogenetic trees from to assign subclades, but also a whole dataset of pre-assigned data. I based my initial nextclade dataset just on the pilot tree dataset, but I think I'm running into a similar issue to what you saw, and need to increase the number of sequences in the Nextclade dataset to improve the accuracy. Did you find that determining the clade-defining nodes improved the accuracy over using phylogenetic placement in Nextclade for the H5 dataset? If you are able to share your manuscript that would be really helpful, although it may already now be on bioRxiV?

rneher · 2024-12-03T21:22:05Z

Hi @AMPByrne, in the H5 case, I believe the main problem is if diversity within a clade is not well represented in the tree and that diversity is large compared to the distance to other clades (short branches defining the clade), then placement of sequences can be unreliable. but I think that would a problem with any method that assigns clades. happy to help if htat would be useful.

lmoncla · 2024-12-03T21:43:10Z

@AMPByrne if you shoot me an email ([email protected]), I'll send over our paper draft!

Richard correctly summarized the issue with not having enough sequences and the clade placement.

jurresiegers added good first issue Good for newcomers help wanted Extra attention is needed needs triage Mark for review and label assignment t:feat Type: request of a new feature, functionality, enchancement labels Nov 18, 2024

ivan-aksamentov removed good first issue Good for newcomers help wanted Extra attention is needed needs triage Mark for review and label assignment labels Nov 18, 2024

ivan-aksamentov mentioned this issue Nov 18, 2024

ENH: Avian flu datasets, e.g. H5 #870

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

H9Nx (all lineages), H9Nx (Y-lineage), H9Nx (G-lineage), H9Nx (B-lineage) #1552

H9Nx (all lineages), H9Nx (Y-lineage), H9Nx (G-lineage), H9Nx (B-lineage) #1552

jurresiegers commented Nov 18, 2024 •

edited

Loading

ivan-aksamentov commented Nov 18, 2024 •

edited

Loading

jurresiegers commented Nov 18, 2024

ivan-aksamentov commented Nov 18, 2024

AMPByrne commented Nov 18, 2024

ivan-aksamentov commented Nov 18, 2024 •

edited

Loading

lmoncla commented Nov 20, 2024 •

edited

Loading

AMPByrne commented Nov 29, 2024

rneher commented Dec 3, 2024

lmoncla commented Dec 3, 2024

H9Nx (all lineages), H9Nx (Y-lineage), H9Nx (G-lineage), H9Nx (B-lineage) #1552

H9Nx (all lineages), H9Nx (Y-lineage), H9Nx (G-lineage), H9Nx (B-lineage) #1552

Comments

jurresiegers commented Nov 18, 2024 • edited Loading

ivan-aksamentov commented Nov 18, 2024 • edited Loading

jurresiegers commented Nov 18, 2024

ivan-aksamentov commented Nov 18, 2024

AMPByrne commented Nov 18, 2024

ivan-aksamentov commented Nov 18, 2024 • edited Loading

lmoncla commented Nov 20, 2024 • edited Loading

AMPByrne commented Nov 29, 2024

rneher commented Dec 3, 2024

lmoncla commented Dec 3, 2024

jurresiegers commented Nov 18, 2024 •

edited

Loading

ivan-aksamentov commented Nov 18, 2024 •

edited

Loading

ivan-aksamentov commented Nov 18, 2024 •

edited

Loading

lmoncla commented Nov 20, 2024 •

edited

Loading