Download subset of genome sequences for selected tree nodes #1110

biocyberman · 2020-05-06T21:25:28Z

Context
Similar to download a subset of metadata, we want to use this to extract a subset of genome sequences for further analysis. This might not be helpful or allowed in the global nextstrain/auspice instance, but for our local one, it is legal and useful feature to have.

Description
This feature should work almost exactly like extract subset of metadata.

Possible solution

augur export need to export and include genome sequences upon user choice (i.e. an --export-genomes flag)
genomes are saved in the same directory as auspice's datasetDir.
if auspice can find files endswith -sequences.json it present the download subset of sequences button in the Download Data popup window.

An observation: auspice removed handling of sequences.JSON at version 1.8.0. Probably this feature is related to the code of handling sequences.

The text was updated successfully, but these errors were encountered:

jameshadfield · 2020-05-06T23:28:58Z

This feature could be made part of auspice and made an "opt-out" extension (or opt-in) so that different implementations can choose whether or not to expose it. I know others have asked for it, so I think it would get used. Happy for someone to implement it, but it's not something we (nextstrain.org) can pursue currently.

Just thinking about it briefly, it would involve a new API call to fetch the sequences, subset them, and download them. Or you could post the subsetted strain list and ask for a matching sequences file from the server. There would be memory/speed considerations here as sequence data can be very large, comes in different formats (VCF, fasta) etcetera. I don't think making a new JSON sequence format would be recommended.

Currently one can download a metadata TSV subsetted appropriately, which you could then use to get the sequences you want via a script (or a different web API etc). I appreciate that it may be nicer to do it all within auspice, but there may be easier short-term solutions.

auspice removed handling of sequences.JSON at version 1.8.0

We used to rely on this to extract mutations to display genotypes as I remember (it was >2 years ago). It's on the horizon for us to implement fetching of one (ancestral) sequence which we need to colour the tree by a position which has no observed mutations. It will probably be in fasta format, but the details haven't been worked out. But this is separate from what's being asked in this issue.

biocyberman · 2020-05-07T07:13:18Z

Thanks for the comments.

I know others have asked for it, so I think it would get used. Happy for someone to implement it, but it's not something we (nextstrain.org) can pursue currently.

Sounds like something worth pursuing. I can probably arrange some time to do this, depending on task priority in the COVID19 project I am working with.

Currently one can download a metadata TSV subsetted appropriately, which you could then use to get the sequences you want via a script (or a different web API etc). I appreciate that it may be nicer to do it all within auspice, but there may be easier short-term solutions.

I already wrote a bash script to do the extraction. It's actually wrote a short script to do that, but I agree, auspice interface is more interactive and less intimidating.

biocyberman · 2020-05-09T10:18:44Z

@jameshadfield I made some progress hacking the feature. Need you comments and guidance:

Since nt_muts.json from a augur already contain sequences. I am thinking to copy it into datasetDir. In strainGenome function, I want to open a stream and filter the json file by strain names and return the FASTA file. The problem is I don't know how and whether it is a good idea to exposure datasetDir there. If not, where would be the better idea to parse nt_muts.json at run time? To improve efficiency, maybe a lightweight database like tingodb would be better? For what is worth nt_muts.json can also be simplified to contain only names and sequences.
Alternatively, with a list of strain names, the function can spawn a shell process pipe to run seqtk command similar to what I did in the shell script and return the file to client. A similar question with accessing the genomes.fasta file.

trvrb · 2024-10-29T18:43:13Z

Bumping this feature request. If we have root-sequence.json as sidecar or embedded in the primary Auspice JSON I believe that it should be possible for Auspice to reconstruct exactly the sequence for all tips in the tree. I'd propose that when clicking Download Data you'd get an option right below "Metadata (TSV)" that would read "Sequences (FASTA)". This download option would be disabled when data_provenance contains GISAID, just like metadata download is disabled.

I've encountered multiple people now where even just having .xz or .zst compression is proving a challenge to working with the data.

Having this option makes complete sense for NCBI analyses like https://nextstrain.org/rabies or https://nextstrain.org/oropouche. We may need to give an opt-in / opt-out option however as authors of datasets like https://nextstrain.org/groups/inrb-mpox/clade-I or https://nextstrain.org/community/inrb-drc/ebola-nord-kivu may prefer users to download data through GitHub, etc...

biocyberman added the enhancement New feature or request label May 6, 2020

biocyberman linked a pull request May 30, 2020 that will close this issue

Add opt-in genome download feature #1149

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download subset of genome sequences for selected tree nodes #1110

Download subset of genome sequences for selected tree nodes #1110

biocyberman commented May 6, 2020

jameshadfield commented May 6, 2020 •

edited

Loading

biocyberman commented May 7, 2020

biocyberman commented May 9, 2020 •

edited

Loading

trvrb commented Oct 29, 2024

Download subset of genome sequences for selected tree nodes #1110

Download subset of genome sequences for selected tree nodes #1110

Comments

biocyberman commented May 6, 2020

jameshadfield commented May 6, 2020 • edited Loading

biocyberman commented May 7, 2020

biocyberman commented May 9, 2020 • edited Loading

trvrb commented Oct 29, 2024

jameshadfield commented May 6, 2020 •

edited

Loading

biocyberman commented May 9, 2020 •

edited

Loading