From 9cb4f21c218c78f0fcbc78228ab9e3919d702bc3 Mon Sep 17 00:00:00 2001 From: Jover Lee Date: Tue, 26 Nov 2024 15:23:26 -0800 Subject: [PATCH 1/4] data-formats: Add table of contents Preparing to add a separate section for TSV files. --- src/reference/data-formats.rst | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/src/reference/data-formats.rst b/src/reference/data-formats.rst index 0f85b701..5c9a2065 100644 --- a/src/reference/data-formats.rst +++ b/src/reference/data-formats.rst @@ -2,6 +2,12 @@ Data formats ============ +.. contents:: Table of Contents + :local: + +JSON +==== + Nextstrain uses a few different kinds of `JSON `__ files at various stages in a typical build. From 935365109486945ff296cfdd5aaee788485dd632 Mon Sep 17 00:00:00 2001 From: Jover Lee Date: Tue, 26 Nov 2024 15:33:57 -0800 Subject: [PATCH 2/4] data-formats: Add TSV section Proper handling of TSVs with `csvtk`/`tsv-utils` was originally recommended by @tsibley --- src/reference/data-formats.rst | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/src/reference/data-formats.rst b/src/reference/data-formats.rst index 5c9a2065..359201e2 100644 --- a/src/reference/data-formats.rst +++ b/src/reference/data-formats.rst @@ -5,6 +5,31 @@ Data formats .. contents:: Table of Contents :local: +TSV +=== + +Nextstrain generally uses TSV files for metadata. +Nextstrain tools and workflows produce `RFC 4180 CSV-like TSVs `__. + +When using `csvtk `__ + +* the ``--lazy`` (``-l``) option should not be necessary +* the ``fix-quotes``/``del-quotes`` commands should not be necessary + +When using `tsv-utils `__ + +* pass the inputs through ``csv2tsv --csv-delim $'\t'`` +* pass the final ``tsv-util`` outputs through ``csvtk fix-quotes --tabs`` + +.. code-block:: bash + + csv2tsv --csv-delim $'\t' metadata.tsv \ + | tsv-select -H -f strain,date \ + | tsv-uniq -H -f strain \ + | csvtk fix-quotes --tabs > output.tsv + +See our internal `discussion on TSV standardization `__ for more details. + JSON ==== From c561b61cbabdf58da43be1197e71eddb8d86394c Mon Sep 17 00:00:00 2001 From: Jover Lee Date: Wed, 27 Nov 2024 10:27:02 -0800 Subject: [PATCH 3/4] fetch-docs: Add User-Agent to session Adding User-Agent as an attempt to fix the 403 Client Errors returned from requests to GitHub. --- src/fetch-docs.py | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/fetch-docs.py b/src/fetch-docs.py index 985212b0..57b1f296 100755 --- a/src/fetch-docs.py +++ b/src/fetch-docs.py @@ -27,6 +27,9 @@ if __name__ == '__main__': # Use a Session for connection pooling session = requests.Session() + session.headers.update({ + "User-Agent": "https://github.com/nextstrain/docs.nextstrain.org (hello@nextstrain.org)", + }) class RemoteDoc: def __init__(self, source_url, dest_path): From a6d0a870ae1308b7b7d097b55540422e12300e2a Mon Sep 17 00:00:00 2001 From: Jover Lee Date: Wed, 27 Nov 2024 13:17:44 -0800 Subject: [PATCH 4/4] data-formats: Strongly word preference for TSVs Recommended by @genehack in review --- src/reference/data-formats.rst | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/src/reference/data-formats.rst b/src/reference/data-formats.rst index 359201e2..2db1329a 100644 --- a/src/reference/data-formats.rst +++ b/src/reference/data-formats.rst @@ -8,7 +8,9 @@ Data formats TSV === -Nextstrain generally uses TSV files for metadata. +Nextstrain strongly prefers using TSV files for metadata even though Augur commands support other delimiters as inputs. +If you are using other formats, we recommend using :doc:`augur curate passthru ` to convert them to TSV. + Nextstrain tools and workflows produce `RFC 4180 CSV-like TSVs `__. When using `csvtk `__