Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

H9Nx (all lineages), H9Nx (Y-lineage), H9Nx (G-lineage), H9Nx (B-lineage) #1552

Open
jurresiegers opened this issue Nov 18, 2024 · 9 comments
Labels
t:feat Type: request of a new feature, functionality, enchancement

Comments

@jurresiegers
Copy link

jurresiegers commented Nov 18, 2024

Hi all,

Would it be possible to get a H9Nx Nextclade build up and running based on the recently published H9 nomenclature paper? This paper included reference datasets (see Appendix 2 and 3) from GISAID/NCBI for all lineages and specific sub lineages.

https://wwwnc.cdc.gov/eid/article/30/8/23-1176_article

Best,
Jurre Siegers

@jurresiegers jurresiegers added good first issue Good for newcomers help wanted Extra attention is needed needs triage Mark for review and label assignment t:feat Type: request of a new feature, functionality, enchancement labels Nov 18, 2024
@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Nov 18, 2024

Hi @jurresiegers

There's been some discussion in this topic: #870 (comment)

This could change, but currently I am not aware of any concrete plans on Nextstrain team to prepare datasets on this particular topic.

Community contributions are very welcome! Dataset author documentation is here: https://github.com/nextstrain/nextclade_data

@ivan-aksamentov ivan-aksamentov removed good first issue Good for newcomers help wanted Extra attention is needed needs triage Mark for review and label assignment labels Nov 18, 2024
@jurresiegers
Copy link
Author

Thanks Ivan! I will follow up on that topic :)

@ivan-aksamentov
Copy link
Member

@jurresiegers I think it's better to continue here, because that issue was for a different reason and also it is closed. I'll invite people from there to here.

@AMPByrne
Copy link

@ivan-aksamentov and @jurresiegers I've now got a working dataset but have only been able to test on around 300 sequences. Is there any guidance on what's considered adequate testing before submitting datasets?

@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Nov 18, 2024

@AMPByrne There are no particular established criteria - every virus is different.

You could submit a pull request to the data repo, and also give a link to your source repo, where you prepare the dataset, so that other people could test the dataset(s) as well. And then the community can decide if it's any good. And if not, they could suggest improvements. They could also comment in your source repo and submit proposals or fixes there.

The usual points which are discussed in these situations are the choice of reference sequence, sampling of the sequences for reference tree, QC config, how to subdivide datasets if there are multiple distant strains, dataset (path) naming etc.

@lmoncla
Copy link

lmoncla commented Nov 20, 2024

@AMPByrne this generally sounds great, and am happy for you to take the lead on the H9 dataset if you are so inclined, and are already doing it! We are about to put a manuscript describing our approach for the H5 datasets on bioRxiV, but would be happy to share it with you via email if you'd like to see what we did. We found that the clade calls tend to be better with more data, so I'd suggest maximizing the number of sequences you include. We also wanted to make sure that we were assigning things according to the established clades by WHO/FAO/WOAH, so we acquired a reference set from them, identified clade-defining nodes, and then tested performance of NextClade calls against LABEL using all H5 data that we maintain for Nextstrain purposes (which was about 20,000 sequences that were not in the reference set). That was our general approach, and we plan to maintain these H5 ones and continually work to improve them and keep them up to date. Generally are happy to help/collaborate with you, though our current bandwidth is a bit limited, so we may not be able to directly work on this in the next couple of months.

@AMPByrne
Copy link

@lmoncla thanks for the guidance, that's really helpful! Fortunately, the manuscript describing the new lineage system provides pilot datasets to make phylogenetic trees from to assign subclades, but also a whole dataset of pre-assigned data. I based my initial nextclade dataset just on the pilot tree dataset, but I think I'm running into a similar issue to what you saw, and need to increase the number of sequences in the Nextclade dataset to improve the accuracy. Did you find that determining the clade-defining nodes improved the accuracy over using phylogenetic placement in Nextclade for the H5 dataset? If you are able to share your manuscript that would be really helpful, although it may already now be on bioRxiV?

@rneher
Copy link
Member

rneher commented Dec 3, 2024

Hi @AMPByrne, in the H5 case, I believe the main problem is if diversity within a clade is not well represented in the tree and that diversity is large compared to the distance to other clades (short branches defining the clade), then placement of sequences can be unreliable. but I think that would a problem with any method that assigns clades. happy to help if htat would be useful.

@lmoncla
Copy link

lmoncla commented Dec 3, 2024

@AMPByrne if you shoot me an email ([email protected]), I'll send over our paper draft!

Richard correctly summarized the issue with not having enough sequences and the clade placement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t:feat Type: request of a new feature, functionality, enchancement
Projects
None yet
Development

No branches or pull requests

5 participants