Skip to content

Commit

Permalink
Add creating-an-ingest-workflow tutorial
Browse files Browse the repository at this point in the history
Based on contents of the initial draft of the tutorial
https://docs.google.com/document/d/1_16VYT5MU8oXJ4t6HUHp_smx_kgF9OMsORSCldee1_0/edit#heading=h.r95jmyuit0s0

Split out the steps to get the uncurated NCBI Dataset data into a
snippet that can be shared between the two ingest tutorials.
  • Loading branch information
joverlee521 committed Mar 22, 2024
1 parent 36f009e commit 59527e5
Show file tree
Hide file tree
Showing 4 changed files with 482 additions and 13 deletions.
11 changes: 11 additions & 0 deletions src/snippets/uncurated-ncbi-dataset.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
.. code-block::
$ nextstrain shell .
$ datasets download virus genome taxon <taxon-id> --filename ingest/data/ncbi_dataset.zip
$ dataformat tsv virus-genome --package ingest/data/ncbi_dataset.zip > ingest/data/raw_metadata.tsv
$ exit
The produced ``ingest/data/raw_metadata.tsv`` will contain all of the fields available from NCBI Datasets.
Note that the headers in this file use the human readable ``Name`` of the
`NCBI Datasets' available fields <https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/dataformat/tsv/dataformat_tsv_virus-genome/#fields>`_,
while the pipeline uses the config's ``curate.field_map`` dictionary to convert these to computer friendly ``Mnemonic``.
Loading

0 comments on commit 59527e5

Please sign in to comment.