-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
merge: Support sequences #1579
Comments
Assigning to @victorlin per planning doc commentary. |
The proposed workaround: augur merge --metadata a=a.tsv b=b.csv --output-metadata c.tsv
cat a.fasta b.fasta > all.fasta
augur filter --metadata c.tsv --sequences all.fasta --output-sequences c.fasta does not work in the case of an entry existing in both A proper workaround would be using augur merge --metadata a=a.tsv b=b.csv --output-metadata c.tsv
seqkit rmdup b.fasta a.fasta > c.fasta
augur filter --metadata c.tsv --sequences c.fasta --output-metadata merged.tsv --output-sequences merged.fasta noting that the order of The workaround is still not great since the metadata and sequence files are loosely coupled. augur merge \
--metadata a=a.tsv b=b.csv \
--sequences b=b.fasta a=a.fasta \
--output-metadata merged.tsv \
--output-sequences merged.fasta
# ERROR: Order of inputs differs between metadata (a,b) and sequences (b,a).
# Please update one to match the other, noting that when an entry is in multiple
# inputs, only the entry in the last input will be kept.
augur merge \
--metadata a=a.tsv b=b.csv \
--sequences a=a.fasta b=b.fasta c=c.fasta \
--output-metadata merged.tsv \
--output-sequences merged.fasta
# ERROR: Sequence file (c.fasta) does not have a corresponding metadata file.
augur merge \
--metadata a=a.tsv b=b.csv \
--sequences a=a.fasta b=b.fasta \
--output-metadata c.tsv
--output-sequences c.fasta
# WARNING: Sequence `XXX` in a.tsv is missing from a.fasta. It will not be present in any output.
# WARNING: Sequence `YYY` in b.fasta is missing from b.csv. It will not be present in any output. |
Hmm. I'm not sure about these example errors. I know the original thinking was to pair metadata and sequences via the names given, but upon further reflection, I'm not sure we should require it.
Why require a matched order? That seems unfriendly to me, like asking the user to do unnecessary tedium for the computer's sake. I'd think to either a. require named sequence inputs and make the order of Of the two, (b) would be better, I think. If (a) and we want to support invocations without
It seems to me to be reasonable/likely to have two or more paired metadata/sequence files, plus an extra file or two of sequences (e.g. corrected sequences) that don't have or need their own separate metadata. With strictly named pairs of metadata/sequence files, that wouldn't be possible and a stub metadata file would have to be fabricated for the sequences to avoid this error.
These warnings seem good to me in general, but may need tweaking if behaviour (a) or (b) above is chosen instead. |
Before debating requirement of a matched order we should first settle on whether to require named sequence inputs. The only way to have these warnings
is if the different inputs could be paired together (e.g. via named sequence inputs). Without this pairing, there is not much benefit to allowing augur merge --metadata a=a.tsv b=b.csv --output-metadata c.tsv
augur merge --sequences b.fasta a.fasta --output-sequences c.fasta How about (c) require named sequence inputs when used with metadata? I'll provide an example below.
This is reasonable, thanks for providing the example. As an extension: suppose there are datasets A ( augur merge \
--metadata a=a.tsv b=b.csv \
--sequences a.fasta b.fasta b_corrected.fasta \
--output-metadata merged.tsv \
--output-sequences merged.fasta but that wouldn't do any validation between metadata/sequences. It might as well be separate commands: augur merge \
--metadata a=a.tsv b=b.csv \
--output-metadata merged.tsv
augur merge \
--sequences a.fasta b.fasta b_corrected.fasta \
--output-sequences merged.fasta (c) would allow for this: # Create single FASTA file for B
augur merge \
--sequences b.fasta b_corrected.fasta \
--output-sequences b_corrected_merged.fasta
# Merge with paired validation
augur merge \
--metadata a=a.tsv b=b.csv \
--sequences a=a.fasta b=b_corrected_merged.fasta \
--output-metadata merged.tsv \
--output-sequences merged.fasta
# WARNING: Sequence `XXX` in a.tsv is missing from a.fasta. It will not be present in any output.
# WARNING: Sequence `YYY` in b_corrected_merged.fasta is missing from b.csv. It will not be present in any output. |
Some specific thoughts below, but the big picture is that I think we'll want to gather more insight into existing use cases and feedback from potential users on the behaviour that's most useful here.
Couldn't we have the warnings list all files if not paired by name?
Or allow optional pairing by name, and multiple sequence inputs per name, to enable the warnings, but not require naming?
Separately, when your example error says, "It will not be present in any output", does that mean missing sequences will filter out metadata records? I'm not sure that's what we want for
Except that as separate commands it'll take up more (potentially much more) transient disk space that's not otherwise required. |
Sure, both of those seem reasonable.
Yes. That's how augur/tests/functional/filter/cram/subsample-max-sequences-with-probabilistic-sampling-warning.t Lines 19 to 25 in f1d65fb
If
👍 good discussion of possibilities so far. Some potential questions for feedback:
|
Yeah, I know that's how
Ah, I meant in this example of yours:
where |
That's a good point. My mind was stuck in In the future, if paired validation becomes necessary in workflows, we can consider adding support for both metadata+sequences in a single command as an additional feature. |
To summarize, there are two different approaches:
(1) is much simpler to implement: the bulk of it is an alias to (2) needs work to first define how much cross-checking to do (this was discussed above). Depending on the amount of cross-checking, it may be necessary to:
The prototypes incorporate all of the above and should be functional enough to help decide which approach we want to take, at least initially. |
Thinking through this again, we could allow all these scenarios in the same implementation: augur merge \
--metadata a=a.tsv b=b.csv \
--sequences c.fasta b.fasta a.fasta
# WARNING: Sequence inputs are unnamed. Skipping validation between metadata and sequences.
augur merge \
--metadata a=a.tsv b=b.csv \
--sequences c=c.fasta b=b.fasta a=a.fasta
# ERROR: Order of inputs differs between metadata (a,b) and sequences (c,b,a).
# Please update one to match the other, noting that when an entry is in multiple
# inputs, only the entry in the last input will be kept.
# Alternatively, use unnamed sequence inputs to skip validation between
# metadata and sequences.
augur merge \
--metadata a=a.tsv b=b.csv c=c.tsv \
--sequences a=a.fasta b=b.fasta c=c.fasta
# WARNING: Sequence `XXX` in a.tsv is missing from a.fasta. It will not be present in any output.
# WARNING: Sequence `YYY` in b.fasta is missing from b.csv. It will not be present in any output. I'll gather feedback on this interface in Slack. |
@jameshadfield posted a good reminder about segment-specific sequence files in nextstrain/zika#76 (comment). Copy/pasted below to continue discussion. This seems clear-cut for unsegmented pathogens, whereby we have a single For segmented genomes our analyses use a single TSV and 1 FASTA per segment. So the metadata includes samples which may be in only one FASTA (e.g. only one segment was sequenced). AFAIK we haven't got prototypes for how to merge multiple inputs for segmented pathogens beyond my prototype in avian-flu which handles metadata merging and sequence merging independently. It's not clear what's the best approach here. We could continue to treat things independently. When Should |
Yes, it will I'm assuming "treat things independently" means something like this:
This should work just fine, but not optimally since the same metadata would be loaded into SQLite multiple times (and looks redundant to the user).
I think that's the only way to have a more optimal implementation. Instead of the
This retains the 1:many relationship between metadata and sequence files, but deviates from the 1:1 relationship in |
Not quite. As per the linked example I mean something like: augur merge --metadata A=A.tsv B=B.tsv --output-metadata merged.tsv
seqkit rmdup A.seg1.fasta B.seg1.fasta > merged.seg1.fasta
seqkit rmdup A.seg2.fasta B.seg2.fasta > merged.seg2.fasta In your first example (with two Apart from the (nice) sanity checking of command line arguments, I'm not quite understanding the difference between a joint merge of metadata + sequences vs independent merging. I'll make a reminder to circle back here in the new year and revisit. |
augur merge
should support input--sequences
and--output-sequences
.Features:
Prior art:
scripts/sanitize_sequences.py
using a rule that's only called when multiple inputs are involved (paralleling howscripts/combine_metadata.py
is used).See previous discussion in the PR which introduced
augur merge
.Design discussion points
Possible solutions
The text was updated successfully, but these errors were encountered: