Add creating-an-ingest-workflow tutorial

Based on contents of the initial draft of the tutorial https://docs.google.com/document/d/1_16VYT5MU8oXJ4t6HUHp_smx_kgF9OMsORSCldee1_0/edit#heading=h.r95jmyuit0s0 Split out the steps to get the uncurated NCBI Dataset data into a snippet that can be shared between the two ingest tutorials.
nextstrain · Mar 22, 2024 · 59527e5 · 59527e5
1 parent 36f009e
commit 59527e5
Show file tree

Hide file tree

Showing 4 changed files with 482 additions and 13 deletions.
diff --git a/src/snippets/uncurated-ncbi-dataset.rst b/src/snippets/uncurated-ncbi-dataset.rst
@@ -0,0 +1,11 @@
+.. code-block::
+
+    $ nextstrain shell .
+    $ datasets download virus genome taxon <taxon-id> --filename ingest/data/ncbi_dataset.zip
+    $ dataformat tsv virus-genome --package ingest/data/ncbi_dataset.zip > ingest/data/raw_metadata.tsv
+    $ exit
+
+The produced ``ingest/data/raw_metadata.tsv`` will contain all of the fields available from NCBI Datasets.
+Note that the headers in this file use the human readable ``Name`` of the
+`NCBI Datasets' available fields <https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/dataformat/tsv/dataformat_tsv_virus-genome/#fields>`_,
+while the pipeline uses the config's ``curate.field_map`` dictionary to convert these to computer friendly ``Mnemonic``.