-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ingest #10
Add ingest #10
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This workflow ran for me without issue, @kimandrews. I added a couple of minor inline comments. Looking at these metadata and sequences for the first time, some thoughts I had were:
- We're throwing out a lot of data by filtering to 5000 bp sequences. It looks like the majority of sequences represent one or two genes (H and sometimes F, e.g., https://www.ncbi.nlm.nih.gov/nuccore/KJ183860.1). We should consider eventually building gene trees to use those smaller sequences.
- There are 6 samples with a host annotation that is not human (e.g., https://www.ncbi.nlm.nih.gov/nuccore/MG954083.1) and over 4000 without a host annotation at all. None of these samples have sequences that are long enough to get included in the current tree, but we should consider whether we want this workflow to produce human-specific measles trees or multi-host trees especially if we end up making gene trees. Since it's possible that the records without host annotations are not from human hosts, we would want a human-specific workflow to apply an additional host filter in the phylogenetic workflow.
A bigger issue I noticed when trying to use the outputs of this workflow for the phylogenetic workflow is that the use of the strain
column as the metadata id column leads to duplicated records when running augur filter
and causes the phylogenetic workflow to crash. I expect we'll need ingest to use the accession
as the strain
column, but that's probably something you've discussed with @j23414 and @joverlee521 already?
|
||
# Nextstrain AWS S3 Bucket with pathogen prefix | ||
# Replace <pathogen> with the pathogen repo name. | ||
s3_dst: "s3://nextstrain-data/files/workflows/<pathogen>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line needs to be updated to refer to measles
instead of <pathogen>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, the ingest/profiles directory has been removed for now (601edee) and will be added as a later change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's nice to see this come along @kimandrews ! I'm really glad the zika examples that @j23414 worked through was helpful!
As we discussed in person, I think we can reduce confusion by removing the config parameters and rules that will not be used now. This includes the Entrez route for fetching sequences and the Nextclade related rules in rules/nextclade.smk
.
Overall, we are going down the route of using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Kim! Really useful for me to read through what a fresh ingest implementation using the new template repo looks like
ingest/config/defaults.yaml
Outdated
@@ -0,0 +1,117 @@ | |||
# This configuration file should contain all required configuration parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[a general comment on the config, or perhaps the forthcoming tutorial, with the intention of discussion rather than specific changes]
Discussing this with Kim today most of the values were copied over from Zika. This includes the extracted NCBI fields (as thus, as discussed in another comment thread below, we're missing some important fields). It also includes the curate fields, which are necessarily data-dependent (e.g. expected date fields). Is the best-practice to fill in this config field-by-field after inspecting the raw NCBI data, or copy over from another repo and then asses the output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had set up the template for the first route to force users to become familiar with the raw NCBI data and make conscious decisions to fill in the config. However, looking at this now, it might be too steep of a learning for new users.
Adding default values was also requested in nextstrain/pathogen-repo-guide#4 and Kim also provided feedback that it was easier to be able to run and assess the output before making changes.
Seems like the template should just include the suggested default config params and users can then change them as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the template should just include the suggested default config params and users can then change them as needed.
👍 I guess the tutorial is the best place to talk through how one can asses the data to make the config specific to their data? Another option is to start with ~no curation and use that outcome to inform what curation steps to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good discussion! I went ahead and created an pathogen-repo-guide Issue to further discuss a ~no curation option: nextstrain/pathogen-repo-guide#30
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kim & I briefly discussed what the point of normalizing these names is today. The majority of sequences don't have a strain name (and thus in our current curate usage we are using the accession). For those with actual strain names, these rules (taken from fauna + zika) aren't producing a nice uniform set of names as there's just too much variability. Briefly scanning through the data there is no coherent formatting to (non-accession) strain names; for instance here's a few examples:
FR_DF_16_1944
V2484482
up/2014_536_T
TOG_PLA_OGO_13_202
MV/India_Assam/7586AD/2018/D8
MKAK/MEP/2016/1883
MO_2891
MeaNS_sample_ID_55662
zhongshan20140428
B95a
One option is to ignore strain name manipulation (and not replace empty strain names with accessions), use accession as the canonical ID for both Augur and Auspice, and then include strain names as extra metadata in the Auspice JSON for the nodes where it exists. Combined with nextstrain/augur#1383 and Auspice's ability to search across all metadata I think this would make for a nice outcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strain name manipulation has been removed, and empty strain names are no longer replaced (56bfad6)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
For posterities sake, one thing I didn't quite realise is that those strain names above were being "corrected", e.g.
original: up/2014-536-T
curated: up/2014_536_T
which is another reason to skip name curation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From slack thread, deciding to not replace empty strain names with accessions
may have downstream effects in the phylogenetic workflow (e.g. If we display strain name on tips, then Auspice will generate a random string for the name that is displayed in the tree...which is undesirable behaviour.
From @joverlee521 on the same thread:
I think we can leave it as-is and just display the accession as the sequence name in the tree.
There's WIP to allow Auspice to display arbitrary fields as the name so we can then toggle between accession and strain name in the future.
@kimandrews, up-to-you if you want to revert c9dfc64 or leave it. We can also revisit this later, during your phylogenetic PR.
Prompted by nextstrain/measles#10 (comment) These have become common commands used across Nextstrain pathogen repos¹ since they were added to the Docker runtime² ¹ https://github.com/search?q=org%3Anextstrain+AND+%28%22csvtk%22+OR+%22tsv-select%22+OR+%22seqkit%22%29&type=code ² nextstrain/docker-base@992c575
f0a5194
to
0e39ced
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working through the first round of feedback @kimandrews!
I've left a couple more suggestions to remove more sections of the workflow that are not being used.
ingest/rules/fetch_from_ncbi.smk
Outdated
2. Fetch from Entrez (https://www.ncbi.nlm.nih.gov/books/NBK25501/) | ||
- requires `entrez_search_term` config | ||
- Returns all available data via a GenBank file | ||
- Requires a custom script to parse the necessary fields from the GenBank file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section of the docstring can be removed since this workflow does not include Entrez rules anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change was made in cc13eb6
|
||
This produces a `results` directory with the following outputs: | ||
- sequences.fasta | ||
- metadata.tsv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also includes all_metadata.tsv
.
In phylo workflows we have {data,results,auspice}
and it's understood that the files in results
can be numerous and change frequently with workflow updates. For ingest we only have {data,results}
. My understanding is that more complex ingest workflows will populate results
with many files. So maybe we could change the wording here to indicate that of the files in results
these two are the ones that should be used for downstream analysis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, the all_metadata.tsv
looks like an intermediate data file, not the final result metadata.tsv
.
@kimandrews, you can change the path from results/all_metadata.tsv
to data/all_metadata.tsv
so there's less confusion on what the final output files should be (e.g. results/sequences.fasta
and results/metadata.tsv
.
measles/ingest/rules/curate.smk
Line 64 in 90a70cd
metadata="results/all_metadata.tsv", |
measles/ingest/rules/curate.smk
Lines 119 to 123 in 90a70cd
rule subset_metadata: | |
input: | |
metadata="results/all_metadata.tsv", | |
output: | |
subset_metadata="results/metadata.tsv", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in f3fe0c1
# the input as NDJSON records from stdin and output NDJSON records to stdout. | ||
# The final step of the pipeline should convert the NDJSON records to two | ||
# separate files: a metadata TSV and a sequences FASTA. | ||
rule curate: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider the following an idea to implement later, it doesn't have to be part of this PR
The virus name column contains information the phylo build will certainly want to show. It may even be the default colouring in auspice (or may be used as the basis for nextclade clade assignment which will then become the default colouring). The structure of values here is diverse, but the majority follow a straightforward pattern. Using cat results/metadata.tsv | cut -f 3 | sort | uniq -c | sort -r
the top 10 are:
6586 Measles morbillivirus
6557 Measles virus genotype D8
4883 Measles virus genotype B3
1546 Measles virus genotype D4
1341 Measles virus genotype H1
668 Measles virus genotype D9
158 Measles virus genotype H1a
137 Measles virus genotype A
131 Measles virus genotype D5
72 Measles virus genotype B2
I don't think it'd be too hard to write a small script to extract as many genotypes as possible from this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[also non-blocking comment]
I concur with @jameshadfield regarding the inclusion of genotype as a user-desired feature to display, and a first step for creating a nextclade dataset. (exciting! :D ) You can open an issue titled "Parse genotype from NCBI data." Once initiated, you can proceed by creating a distinct branch and subsequent pull request to maintain a focused scope (and focused review process :) ).
To address this task, you have the option to either develop a new script from the ground up or to adapt existing scripts such as fix-zika-strain-names.py or vendored/transform-authors. These scripts already read in from ndjson, but you'd have to modify it to parse virus-name
(or virus_name
) and infer a new "genotype" field, and invoke it somewhere in the curate rule. I'm happy to discuss this when you get to this task or if you have any questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened an issue for this here
ingest/config/defaults.yaml
Outdated
isolate-lineage: strain | ||
geo-region: region | ||
geo-location: location | ||
isolate-collection-date: date |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many more sample collection dates (or dates which we can effectively use as sample collection dates) beyond this field.
For strain names following the WHO schema, the "XX.YY" corresponds to
Date of rash onset if known, otherwise date of specimen collection by epidemiological week (1-53) and year.
So, e.g., Accession JX187583
, which doesn't have a date in our workflow, has sample name MVs/Parma.ITA/47.08
in nuccore so it'd have date of 2008-11-XX | 2008-12-XX
Confusingly, looking at the raw NCBI dataset output that sample name does not appear¹, so there seems to be no way to extract it using the datasets
command. I'm not an expert on the new datasets CLI (entrez, for all its messiness, is what I use), but I'd think it's worth working out what's going on here. I've cross posted this to slack.
¹ Using the unzipped output from rule fetch_ncbi_dataset_package
, then grep 'JX187583' data/ncbi_dataset/ncbi_dataset/data/data_report.jsonl | gron
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing to the WHO schema for measles strain names! I've put in a request to the NCBI datasets team to add the strain
field. We'll see if it's possible on their in end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the meantime, we have many records that do have a strain name that are missing dates.
So I'd think regardless of whether we use datasets
or Entrez, it'd be worth writing a custom script to parse potential dates out of strain names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Similar to my comment above I'd think it fine to defer for a later PR if @kimandrews wanted to merge this and get the phylo part working with this new ingest data. Your call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback and clarifications on how measles data is being reported in GenBank.
Of the current samples that meet our minimum length requirement (5000bp; n=937) and do not have dates listed (n=157), only 3 samples have strain names from which the sampling date could be inferred.
I'm weighing the following options for moving forward:
- Write custom code that will recover sampling dates for the 3 samples
- Wait for NCBI Datasets to include the strain attribute, then write custom code that identifies which column the strain name is in (i.e. either the "strain" or "isolate" column) and subsequently uses strain name to infer sampling dates for samples with missing dates
- Don't use NCBI Datasets; write custom code to parse the GenBank records
- Only use samples that have sampling dates in the "date" column (don't try to infer missing dates from the strain name)
- Other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As reported by the datasets
program. I haven't created any custom code for parsing GenBank records yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend writing this up as an issue to tackle either later on or simultaneously with the phylo build. I think all 4 options could be comfortably considered out-of-scope for this PR. It seems that 'strain' will be parsed by NCBI sometime this year, and I don't it's unreasonable to wait for that, which gives a fair bit of time to think through the parsing approach and how it fits into the project. I don't know much about the temporal signal in measles phylogenetics, so I'd think to explore that as part of the phylo build in order to gauge how important filling in these missing data could be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Number of samples with strain names might change when we change the minimum length filters for gene specific builds in #13.
However, I also definitely support pushing this into a later issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, without the minimum length filter, there are definitely more samples with strain names that could be used to recover empty dates. Seems worthwhile to eventually get this sorted out in a later PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created an issue about creating custom code to parse measles strain names
Copy the ingest directory from pathogen-repo-template: https://github.com/nextstrain/pathogen-repo-template/tree/b8ae886b25877a218ad50380fb44f8825d50aedb/ingest The `ingest/vendored` subdirectory is not copied over since that folder should be added with `git-subrepo`. Future commits will change this to work with measles data.
…/vendored subrepo: subdir: "ingest/vendored" merged: "a0faef5" upstream: origin: "https://github.com/nextstrain/ingest" branch: "main" commit: "a0faef5" git-subrepo: version: "0.4.6" origin: "https://github.com/ingydotnet/git-subrepo" commit: "110b9eb"
Add taxon id and other config parameters related to the curate pipeline. Remove config parameters related to Nextclade because we do not currently have a Nextclade measles dataset.
Harmonizing with modifications to the [pathogen-repo-guide](nextstrain/pathogen-repo-guide@14e5dca)
Remove nextclade-related if-else statements from Snakefile. This also requires editing final rule in curate.smk to output results/metadata.tsv instead of results/subset_metadata.tsv
cc13eb6
to
90a70cd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left one final comment to address for the vendored subrepo, but the other changes look good to me.
subrepo: subdir: "ingest/vendored" merged: "e83c214" upstream: origin: "https://github.com/nextstrain/ingest" branch: "main" commit: "e83c214" git-subrepo: version: "0.4.6" origin: "https://github.com/ingydotnet/git-subrepo" commit: "110b9eb"
Follow up to review in the measles PR nextstrain/measles#10 (comment)
Change the output folder for all_metadata.tsv from ./results to ./data Follow up to [#10 (comment)](#10 (comment))
Description of proposed changes
Add ingest folder from pathogen-repo-template and make measles-specific modifications, following these steps: