Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ingest #10
Add ingest #10
Changes from all commits
866abd6
8555df3
e830f65
375d9ec
9f6cc30
7e10c4b
07416e6
0fca27e
9ebfebc
9afc28f
c9dfc64
d2737ed
6976bd4
3fb3e9e
90a70cd
da3c325
e766152
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also includes
all_metadata.tsv
.In phylo workflows we have
{data,results,auspice}
and it's understood that the files inresults
can be numerous and change frequently with workflow updates. For ingest we only have{data,results}
. My understanding is that more complex ingest workflows will populateresults
with many files. So maybe we could change the wording here to indicate that of the files inresults
these two are the ones that should be used for downstream analysis.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, the
all_metadata.tsv
looks like an intermediate data file, not the final resultmetadata.tsv
.@kimandrews, you can change the path from
results/all_metadata.tsv
todata/all_metadata.tsv
so there's less confusion on what the final output files should be (e.g.results/sequences.fasta
andresults/metadata.tsv
.measles/ingest/rules/curate.smk
Line 64 in 90a70cd
measles/ingest/rules/curate.smk
Lines 119 to 123 in 90a70cd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in f3fe0c1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider the following an idea to implement later, it doesn't have to be part of this PR
The virus name column contains information the phylo build will certainly want to show. It may even be the default colouring in auspice (or may be used as the basis for nextclade clade assignment which will then become the default colouring). The structure of values here is diverse, but the majority follow a straightforward pattern. Using
cat results/metadata.tsv | cut -f 3 | sort | uniq -c | sort -r
the top 10 are:I don't think it'd be too hard to write a small script to extract as many genotypes as possible from this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[also non-blocking comment]
I concur with @jameshadfield regarding the inclusion of genotype as a user-desired feature to display, and a first step for creating a nextclade dataset. (exciting! :D ) You can open an issue titled "Parse genotype from NCBI data." Once initiated, you can proceed by creating a distinct branch and subsequent pull request to maintain a focused scope (and focused review process :) ).
To address this task, you have the option to either develop a new script from the ground up or to adapt existing scripts such as fix-zika-strain-names.py or vendored/transform-authors. These scripts already read in from ndjson, but you'd have to modify it to parse
virus-name
(orvirus_name
) and infer a new "genotype" field, and invoke it somewhere in the curate rule. I'm happy to discuss this when you get to this task or if you have any questions.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened an issue for this here