-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add display name after sequence id in downloads, or allow customization of fasta id, so it can contain metadata or display name #2284
Comments
Hmm, but it probably makes sense for sequences to be keyed by accession in LAPIS. What would Nextclade do with: For me that's probably what we want to aim towards - the accession as the real |
I made a LAPIS issue |
It's worth checking that this won't break some commonly used (but stupid) programs... a lot of them cannot handle anything with spaces or even with slightly weird character. Like IQTree possibly? While that's technically their problem it's a big pain for users to have to strip down fasta names before running such stuff (as we do in Nextstrain...) |
Also, some programs will assume the fasta header (whole thing) matches metadata column X (to match between metadata & fasta files), so users may also run into trouble there. |
Ideally this could be something configurable by the user - to pick how they want their download. |
IQtree processed the sequences properly (i.e. discarding the descriptions). Given this is the default GenBank format I would expect most well-engineered tools to be able to cope with it but I agree there will be some exceptions (including lots of my janky scripts). The LAPIS implementation we've been discussing most recently would allow customisation. |
Thanks for the research @theosanderson As an aside, the NCBI virus default is not great, the title is usually very uninformative - one reason I'm excited for Loculus to do this better 😀 |
Just so we're on the same page about the implementation of this, my understanding at the moment is: SILO/LAPIS doesn't support this; if we want to have it anyways, we'll have to not rely on the direct download from there anymore, but instead build our own download endpoint in the backend, which either proxies the LAPIS one (downloading from there, modifying the files, zip again and send to user) or we build a new download endpoint from scratch. |
A potential (temporary) solution could also be what @theosanderson did here where he added a new (external) service to join the data: #3439 |
I've implemented something here, 80/20 solution, not streaming as perfect is the enemy of the good. Some thoughts here: #3448 (comment) Would still be nice if LAPIS implemented that feature. It's the obvious place to do this. It's a bit silly that we have to download metadata and sequences from LAPIS separately, then join them on the website server before sending them on. |
This is @theosanderson's single file service (the rest is just boilerplate): https://github.com/theosanderson/metadata-sequence-combiner/blob/main/pages/api/combine.ts Interesting how easy it is to do that when you don't have zodios hooks and types to jump fight 😄 Might not work for segment selection though. |
Blocked on LAPIS: GenSpectrum/LAPIS#857I downloaded the most recent mpox sequences from loculus full-mpox to put them into Nextclade.
Within Nextclade, I noticed that all I got is the loculus accession, and I'm sorely missing metadata - i.e. what's currently in our display name.
Would be great if we could either always add the display name after a separator, e.g.
LOCACCESSION|DISPLAYNAME
Or allow the user to configure the fasta id - I guess that's hard if we use LAPIS as LAPIS probably doesn't have a concept of multiple or configurable sequence ids - unless I'm missing something.
Alternatives for searchability: customize adding metadata to fasta sequence headers
The text was updated successfully, but these errors were encountered: