Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a no-curation option for NCBI Ingest #30

Closed
j23414 opened this issue Feb 11, 2024 · 1 comment · Fixed by #38
Closed

Provide a no-curation option for NCBI Ingest #30

j23414 opened this issue Feb 11, 2024 · 1 comment · Fixed by #38
Labels
enhancement New feature or request

Comments

@j23414
Copy link
Contributor

j23414 commented Feb 11, 2024

Context

Inspired by a review comment of the measles/add-ingest pull request, propose an optional rule for skipping (or bypassing) curation steps to produce raw outcomes. These outcomes can then guide the selection of specific curation steps or metadata field transformations.

This idea is loosely connected to the effort of standardizing NCBI field transformations, as discussed in #20.

Description

Allow users to initiate ingest to produce an uncurated metadata.tsv file, from which they can compare against the curated metadata.tsv file. Users can then decide which curation steps to opt-in for or choose to use our defaults.

Examples

Possible solution

One potential approach includes utilizing the "custom_rules" config in the Snakefile. Users can then import "no-curate" rules that take data/ncbi.ndjson , pass it through augur curate passthru , and generate an uncurated metadata.tsv file.

@j23414 j23414 added the enhancement New feature or request label Feb 11, 2024
@joverlee521
Copy link
Contributor

I'm not sure we would need custom rules here...

It should be enough to document the NCBI Datasets output target data/ncbi_dataset_report.tsv so that users can inspect the raw data there.

joverlee521 added a commit that referenced this issue Mar 28, 2024
Provides an easy way for first time users to get the uncurated metadata
from NCBI Datasets commands by running the ingest workflow with the
specified target `data/ncbi_dataset_report.tsv`.

Afterwards, users can easily remove fields that are not needed as part
the workflow to reduce the file size and save space.

Prompted by @jameshadfield in review of the tutorial¹ and
resolves #30.

¹ nextstrain/docs.nextstrain.org#195 (comment)
joverlee521 added a commit that referenced this issue Mar 29, 2024
Provides an easy way for first time users to get the full
uncurated metadata from NCBI Datasets commands by running the
ingest workflow with the specified target `dump_ncbi_dataset_report`.
They can then inspect and explore the raw data to determine if they
want to configure the workflow to use additional fields from NCBI.

The rule is added to `fetch_from_ncbi.smk` to make it easy to run
without additional configs. Note that it is not run as part of the
default workflow and only intended to be used as a specified target.

Prompted by @jameshadfield in review of the tutorial¹ and
resolves #30.

¹ nextstrain/docs.nextstrain.org#195 (comment)

Co-authored-by: James Hadfield <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants