Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add display name after sequence id in downloads, or allow customization of fasta id, so it can contain metadata or display name #2284

Open
corneliusroemer opened this issue Jul 10, 2024 · 13 comments · May be fixed by #3448
Labels
discussion Open questions epic A major task that should be broken down into smaller tasks

Comments

@corneliusroemer
Copy link
Contributor

corneliusroemer commented Jul 10, 2024

Blocked on LAPIS: GenSpectrum/LAPIS#857

I downloaded the most recent mpox sequences from loculus full-mpox to put them into Nextclade.

Within Nextclade, I noticed that all I got is the loculus accession, and I'm sorely missing metadata - i.e. what's currently in our display name.

Would be great if we could either always add the display name after a separator, e.g. LOCACCESSION|DISPLAYNAME

Or allow the user to configure the fasta id - I guess that's hard if we use LAPIS as LAPIS probably doesn't have a concept of multiple or configurable sequence ids - unless I'm missing something.

Brave Browser 2024-07-10 16 47 21

Alternatives for searchability: customize adding metadata to fasta sequence headers

@corneliusroemer corneliusroemer changed the title ENH: Allow customization of fasta id, so it can contain metadata or display name ENH: Add display name after sequence id in downloads, or allow customization of fasta id, so it can contain metadata or display name Jul 10, 2024
@theosanderson
Copy link
Member

theosanderson commented Jul 10, 2024

Hmm, but it probably makes sense for sequences to be keyed by accession in LAPIS.

What would Nextclade do with:
image

For me that's probably what we want to aim towards - the accession as the real ID and then the display name as a description.

@corneliusroemer
Copy link
Contributor Author

corneliusroemer commented Jul 10, 2024

Indeed, Nextclade would just show the whole thing that follows >

We should add a fasta description, that's what I meant.

Brave Browser 2024-07-10 17 26 16 image

@theosanderson
Copy link
Member

I made a LAPIS issue

@emmahodcroft
Copy link
Member

emmahodcroft commented Jul 11, 2024

It's worth checking that this won't break some commonly used (but stupid) programs... a lot of them cannot handle anything with spaces or even with slightly weird character. Like IQTree possibly? While that's technically their problem it's a big pain for users to have to strip down fasta names before running such stuff (as we do in Nextstrain...)

@emmahodcroft
Copy link
Member

emmahodcroft commented Jul 11, 2024

Also, some programs will assume the fasta header (whole thing) matches metadata column X (to match between metadata & fasta files), so users may also run into trouble there.

@emmahodcroft
Copy link
Member

Ideally this could be something configurable by the user - to pick how they want their download.

@theosanderson
Copy link
Member

I was just testing out iqtree with NCBI Virus (haven't finished that yet) and they indeed have nice customisation:
image

@theosanderson
Copy link
Member

IQtree processed the sequences properly (i.e. discarding the descriptions). Given this is the default GenBank format I would expect most well-engineered tools to be able to cope with it but I agree there will be some exceptions (including lots of my janky scripts). The LAPIS implementation we've been discussing most recently would allow customisation.

@corneliusroemer
Copy link
Contributor Author

Thanks for the research @theosanderson

As an aside, the NCBI virus default is not great, the title is usually very uninformative - one reason I'm excited for Loculus to do this better 😀

@chaoran-chen chaoran-chen added this to the post-MVP milestone Jul 22, 2024
@chaoran-chen chaoran-chen added the epic A major task that should be broken down into smaller tasks label Jul 22, 2024
@chaoran-chen chaoran-chen moved this to Backlog in Planning Jul 22, 2024
@anna-parker anna-parker added the silo/LAPIS upstream silo/LAPIS issue label Aug 15, 2024
@corneliusroemer corneliusroemer moved this from Backlog to Prioritized in Planning Sep 9, 2024
@corneliusroemer corneliusroemer removed the silo/LAPIS upstream silo/LAPIS issue label Dec 9, 2024
@fhennig fhennig added the discussion Open questions label Dec 16, 2024
@fhennig
Copy link
Contributor

fhennig commented Dec 16, 2024

Just so we're on the same page about the implementation of this, my understanding at the moment is: SILO/LAPIS doesn't support this; if we want to have it anyways, we'll have to not rely on the direct download from there anymore, but instead build our own download endpoint in the backend, which either proxies the LAPIS one (downloading from there, modifying the files, zip again and send to user) or we build a new download endpoint from scratch.

@chaoran-chen
Copy link
Member

A potential (temporary) solution could also be what @theosanderson did here where he added a new (external) service to join the data: #3439

@corneliusroemer corneliusroemer linked a pull request Dec 17, 2024 that will close this issue
2 tasks
@corneliusroemer
Copy link
Contributor Author

I've implemented something here, 80/20 solution, not streaming as perfect is the enemy of the good. Some thoughts here: #3448 (comment)

Would still be nice if LAPIS implemented that feature. It's the obvious place to do this. It's a bit silly that we have to download metadata and sequences from LAPIS separately, then join them on the website server before sending them on.

@corneliusroemer
Copy link
Contributor Author

This is @theosanderson's single file service (the rest is just boilerplate): https://github.com/theosanderson/metadata-sequence-combiner/blob/main/pages/api/combine.ts

Interesting how easy it is to do that when you don't have zodios hooks and types to jump fight 😄 Might not work for segment selection though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Open questions epic A major task that should be broken down into smaller tasks
Projects
Status: Blocked
Development

Successfully merging a pull request may close this issue.

6 participants