merge: Support sequences #1579

tsibley · 2024-08-15T18:30:34Z

augur merge should support input --sequences and --output-sequences.

Features:

Sequences with same id are coalesced right-to-left, as with metadata rows.
Duplication handling within FASTA file? Compare to duplication handling within metadata file (i.e. no dups allowed).

Prior art:

ncov deduplicates sequences via scripts/sanitize_sequences.py using a rule that's only called when multiple inputs are involved (paralleling how scripts/combine_metadata.py is used).

See previous discussion in the PR which introduced augur merge.

Design discussion points

How to handle/require named inputs? (ref)
- Names should be optional.
- If provided, there should be cross-checking with named metadata.
How to handle segment-specific sequence files? (ref)

Possible solutions

The text was updated successfully, but these errors were encountered:

tsibley · 2024-08-21T21:00:08Z

Assigning to @victorlin per planning doc commentary.

victorlin · 2024-08-29T22:04:28Z

The proposed workaround:

augur merge --metadata a=a.tsv b=b.csv --output-metadata c.tsv
cat a.fasta b.fasta > all.fasta
augur filter --metadata c.tsv --sequences all.fasta --output-sequences c.fasta

does not work in the case of an entry existing in both a and b. augur merge will de-duplicate while cat keeps both entries, so augur filter would not properly handle (and should error on) the duplicates in all.fasta.

A proper workaround would be using seqkit rmdup in place of cat:

augur merge --metadata a=a.tsv b=b.csv --output-metadata c.tsv
seqkit rmdup b.fasta a.fasta > c.fasta
augur filter --metadata c.tsv --sequences c.fasta --output-metadata merged.tsv --output-sequences merged.fasta

noting that the order of b.fasta a.fasta is necessary to keep consistent with augur merge behavior of coalescing right-to-left. The purpose of augur filter is then to remove any entries present in sequences that are missing from metadata (and vice versa).

The workaround is still not great since the metadata and sequence files are loosely coupled. augur merge could more tightly couple this by requiring the NAME=FILE input format. Take these hypothetical user errors and warnings as an example:

augur merge \
  --metadata a=a.tsv b=b.csv \
  --sequences b=b.fasta a=a.fasta \
  --output-metadata merged.tsv \
  --output-sequences merged.fasta
# ERROR: Order of inputs differs between metadata (a,b) and sequences (b,a).
# Please update one to match the other, noting that when an entry is in multiple
# inputs, only the entry in the last input will be kept.

augur merge \
  --metadata a=a.tsv b=b.csv \
  --sequences a=a.fasta b=b.fasta c=c.fasta \
  --output-metadata merged.tsv \
  --output-sequences merged.fasta
# ERROR: Sequence file (c.fasta) does not have a corresponding metadata file.

augur merge \
  --metadata a=a.tsv b=b.csv \
  --sequences a=a.fasta b=b.fasta \
  --output-metadata c.tsv
  --output-sequences c.fasta
# WARNING: Sequence `XXX` in a.tsv is missing from a.fasta. It will not be present in any output.
# WARNING: Sequence `YYY` in b.fasta is missing from b.csv. It will not be present in any output.

tsibley · 2024-08-31T01:18:31Z

Hmm. I'm not sure about these example errors. I know the original thinking was to pair metadata and sequences via the names given, but upon further reflection, I'm not sure we should require it.

# ERROR: Order of inputs differs between metadata (a,b) and sequences (b,a).
# Please update one to match the other, noting that when an entry is in multiple
# inputs, only the entry in the last input will be kept.

Why require a matched order? That seems unfriendly to me, like asking the user to do unnecessary tedium for the computer's sake. I'd think to either

a. require named sequence inputs and make the order of --metadata be the authoritative one regardless of order in --sequences, or
b. don't require named sequence inputs and make the order of --sequences be the order to merge sequences, regardless of the order of --metadata. Names, if given, would allow for stricter warnings/validation of the metadata matching sequences.

Of the two, (b) would be better, I think.

If (a) and we want to support invocations without --metadata (i.e. --sequences only), then the order given to --sequences can matter in that specific case. For (b), this isn't an issue.

# ERROR: Sequence file (c.fasta) does not have a corresponding metadata file.

It seems to me to be reasonable/likely to have two or more paired metadata/sequence files, plus an extra file or two of sequences (e.g. corrected sequences) that don't have or need their own separate metadata. With strictly named pairs of metadata/sequence files, that wouldn't be possible and a stub metadata file would have to be fabricated for the sequences to avoid this error.

# WARNING: Sequence `XXX` in a.tsv is missing from a.fasta. It will not be present in any output.
# WARNING: Sequence `YYY` in b.fasta is missing from b.csv. It will not be present in any output.

These warnings seem good to me in general, but may need tweaking if behaviour (a) or (b) above is chosen instead.

victorlin · 2024-09-03T20:00:24Z

Before debating requirement of a matched order we should first settle on whether to require named sequence inputs.

The only way to have these warnings

# WARNING: Sequence `XXX` in a.tsv is missing from a.fasta. It will not be present in any output.
# WARNING: Sequence `YYY` in b.fasta is missing from b.csv. It will not be present in any output.

is if the different inputs could be paired together (e.g. via named sequence inputs). Without this pairing, there is not much benefit to allowing --metadata and --sequences in the same command. I think at that point, it would be more clear if separate commands are used to signify that no pairing/validation is happening:

augur merge --metadata a=a.tsv b=b.csv --output-metadata c.tsv
augur merge --sequences b.fasta a.fasta --output-sequences c.fasta

How about (c) require named sequence inputs when used with metadata? I'll provide an example below.

It seems to me to be reasonable/likely to have two or more paired metadata/sequence files, plus an extra file or two of sequences (e.g. corrected sequences) that don't have or need their own separate metadata.

This is reasonable, thanks for providing the example. As an extension: suppose there are datasets A (a.tsv, a.fasta) and B (b.tsv, b.fasta, b_corrected.fasta). We could support all in one command:

augur merge \
  --metadata a=a.tsv b=b.csv \
  --sequences a.fasta b.fasta b_corrected.fasta \
  --output-metadata merged.tsv \
  --output-sequences merged.fasta

but that wouldn't do any validation between metadata/sequences. It might as well be separate commands:

augur merge \
  --metadata a=a.tsv b=b.csv \
  --output-metadata merged.tsv
augur merge \
  --sequences a.fasta b.fasta b_corrected.fasta \
  --output-sequences merged.fasta

(c) would allow for this:

# Create single FASTA file for B
augur merge \
  --sequences b.fasta b_corrected.fasta \
  --output-sequences b_corrected_merged.fasta

# Merge with paired validation
augur merge \
  --metadata a=a.tsv b=b.csv \
  --sequences a=a.fasta b=b_corrected_merged.fasta \
  --output-metadata merged.tsv \
  --output-sequences merged.fasta
# WARNING: Sequence `XXX` in a.tsv is missing from a.fasta. It will not be present in any output.
# WARNING: Sequence `YYY` in b_corrected_merged.fasta is missing from b.csv. It will not be present in any output.

tsibley · 2024-09-03T20:26:53Z

Some specific thoughts below, but the big picture is that I think we'll want to gather more insight into existing use cases and feedback from potential users on the behaviour that's most useful here.

The only way to have these warnings […] is if the different inputs could be paired together (e.g. via named sequence inputs).

Couldn't we have the warnings list all files if not paired by name?

# WARNING: Sequence `XXX` in a.tsv is missing from all sequence files.
# WARNING: Sequence `YYY` in b.fasta is missing from all metadata tables.

Or allow optional pairing by name, and multiple sequence inputs per name, to enable the warnings, but not require naming?

augur merge \
  --metadata a=a.tsv b=b.csv \
  --sequences a=a.fasta b=b.fasta b=b_corrected.fasta cross_input_corrected.fasta \
  --output-metadata merged.tsv \
  --output-sequences merged.fasta

Separately, when your example error says, "It will not be present in any output", does that mean missing sequences will filter out metadata records? I'm not sure that's what we want for augur merge.

It might as well be separate commands:

Except that as separate commands it'll take up more (potentially much more) transient disk space that's not otherwise required.

victorlin · 2024-09-03T22:10:02Z

Couldn't we have the warnings list all files if not paired by name?
Or allow optional pairing by name, and multiple sequence inputs per name, to enable the warnings, but not require naming?

Sure, both of those seem reasonable.

Separately, when your example error says, "It will not be present in any output", does that mean missing sequences will filter out metadata records? I'm not sure that's what we want for augur merge.

Yes. That's how augur filter works currently, and I'm assuming we want to stay consistent:

augur/tests/functional/filter/cram/subsample-max-sequences-with-probabilistic-sampling-warning.t

Lines 19 to 25 in f1d65fb

    
           10 strains were dropped during filtering 
        
           	1 had no metadata 
        
           	1 had no sequence data 
        
           	1 was dropped because it was earlier than 2012.0 or missing a date 
        
           	1 was dropped during grouping due to ambiguous month information 
        
           	6 were dropped because of subsampling criteria, using seed 314159 
        
           3 strains passed all filters

as separate commands it'll take up more (potentially much more) transient disk space that's not otherwise required.

If augur merge --sequences a.fasta b.fasta is a wrapper for seqkit rmdup b.fasta a.fasta, I don't see any overhead with separate commands. But this could change with implementation.

the big picture is that I think we'll want to gather more insight into existing use cases and feedback from potential users on the behaviour that's most useful here.

👍 good discussion of possibilities so far. Some potential questions for feedback:

Is it beneficial to validate that an entry is present in both files of an input pair?
Example: warn if an entry in a.tsv is missing from a.fasta
Is it beneficial to aggregate input types and validate that an entry is present in both types?
Example: warn if an entry in a.tsv + b.tsv is missing from a.fasta + b.fasta

In this example, would it be beneficial or confusing if SEQ2 in c.tsv came from b.tsv while in c.fasta it came from a.fasta?

# a: SEQ1, SEQ2
# b:       SEQ2, SEQ3

augur merge \
  --metadata a=a.tsv b=b.tsv \
  --sequences b=b.fasta a=a.fasta \
  --output-metadata c.tsv
  --output-sequences c.fasta

tsibley · 2024-09-04T06:36:10Z

Yes. That's how augur filter works currently, and I'm assuming we want to stay consistent:

Yeah, I know that's how augur filter works. I guess I think of merging as something separate than a filter step here. It'd be weird IMO for data to be filtered out by augur merge.

If augur merge --sequences a.fasta b.fasta is a wrapper for seqkit rmdup b.fasta a.fasta, I don't see any overhead with separate commands. But this could change with implementation.

Ah, I meant in this example of yours:

# Create single FASTA file for B
augur merge \
  --sequences b.fasta b_corrected.fasta \
  --output-sequences b_corrected_merged.fasta

# Merge with paired validation
augur merge \
  --metadata a=a.tsv b=b.csv \
  --sequences a=a.fasta b=b_corrected_merged.fasta \
  --output-metadata merged.tsv \
  --output-sequences merged.fasta

where b_corrected_merged.fasta is the extra disk space needed.

victorlin · 2024-09-04T17:09:01Z

Yeah, I know that's how augur filter works. I guess I think of merging as something separate than a filter step here. It'd be weird IMO for data to be filtered out by augur merge.

That's a good point. My mind was stuck in augur filter land but it sounds reasonable to keep filtering out of augur merge. In that case, is there any point in having paired validation between input types? If not, the simplest approach would be to implement metadata and sequence merge support to be mutually exclusive. This would avoid the complexities that arise when allowing them to be used together such as which order to coalesce and the extent of paired validation.

In the future, if paired validation becomes necessary in workflows, we can consider adding support for both metadata+sequences in a single command as an additional feature.

victorlin · 2024-09-16T19:05:38Z

To summarize, there are two different approaches:

(1) is much simpler to implement: the bulk of it is an alias to seqkit rmdup.

(2) needs work to first define how much cross-checking to do (this was discussed above). Depending on the amount of cross-checking, it may be necessary to:

split the metadata merging logic into database import / merge output
read the sequence IDs into a database table
add validation across metadata and sequence tables

The prototypes incorporate all of the above and should be functional enough to help decide which approach we want to take, at least initially.

victorlin · 2024-09-25T22:07:04Z

Thinking through this again, we could allow all these scenarios in the same implementation:

augur merge \
  --metadata a=a.tsv b=b.csv \
  --sequences c.fasta b.fasta a.fasta
# WARNING: Sequence inputs are unnamed. Skipping validation between metadata and sequences.

augur merge \
  --metadata a=a.tsv b=b.csv \
  --sequences c=c.fasta b=b.fasta a=a.fasta
# ERROR: Order of inputs differs between metadata (a,b) and sequences (c,b,a).
# Please update one to match the other, noting that when an entry is in multiple
# inputs, only the entry in the last input will be kept.
# Alternatively, use unnamed sequence inputs to skip validation between
# metadata and sequences.

augur merge \
  --metadata a=a.tsv b=b.csv c=c.tsv \
  --sequences a=a.fasta b=b.fasta c=c.fasta
# WARNING: Sequence `XXX` in a.tsv is missing from a.fasta. It will not be present in any output.
# WARNING: Sequence `YYY` in b.fasta is missing from b.csv. It will not be present in any output.

I'll gather feedback on this interface in Slack.

victorlin · 2024-12-16T17:27:07Z

@jameshadfield posted a good reminder about segment-specific sequence files in nextstrain/zika#76 (comment). Copy/pasted below to continue discussion.

This seems clear-cut for unsegmented pathogens, whereby we have a single augur merge command which collects all n metadata TSVs & m sequence FASTAs into a single TSV & FASTA.

For segmented genomes our analyses use a single TSV and 1 FASTA per segment. So the metadata includes samples which may be in only one FASTA (e.g. only one segment was sequenced). AFAIK we haven't got prototypes for how to merge multiple inputs for segmented pathogens beyond my prototype in avian-flu which handles metadata merging and sequence merging independently. It's not clear what's the best approach here.

We could continue to treat things independently. When augur merge is given both sequences & metadata does it do more than drop rows?

Should augur merge be aware of segments? e.g. you could specify something like --sequences usvi=seg1=<path> usvi=seg2=<path>.

victorlin · 2024-12-16T17:28:40Z

When augur merge is given both sequences & metadata does it do more than drop rows?

Yes, it will

I'm assuming "treat things independently" means something like this:

augur merge
  --metadata
      ingest=ingest.tsv
      usvi=usvi.tsv
  --sequences
      ingest=ingest-seg1.fasta
      usvi=usvi-seg1.fasta
  --output-metadata
      merged.tsv
  --output-sequences
      merged-seg1.fasta

augur merge
  --metadata
      ingest=ingest.tsv
      usvi=usvi.tsv
  --sequences
      ingest=ingest-seg2.fasta
      usvi=usvi-seg2.fasta
  --output-metadata
      merged.tsv
  --output-sequences
      merged-seg2.fasta

This should work just fine, but not optimally since the same metadata would be loaded into SQLite multiple times (and looks redundant to the user).

Should augur merge be aware of segments?

I think that's the only way to have a more optimal implementation. Instead of the usvi=seg1=<path> syntax, another option that keeps the existing name=file syntax would be to allow arbitrary --[output-]sequences-*:

augur merge
  --metadata
      ingest=ingest.tsv
      usvi=usvi.tsv
  --sequences-seg1
      ingest=ingest-seg1.fasta
      usvi=usvi-seg1.fasta
  --sequences-seg2
      ingest=ingest-seg2.fasta
      usvi=usvi-seg2.fasta
  --output-metadata
      merged.tsv
  --output-sequences-seg1
      merged-seg1.fasta
  --output-sequences-seg2
      merged-seg2.fasta

This retains the 1:many relationship between metadata and sequence files, but deviates from the 1:1 relationship in augur filter that's reflected in the current implementation. Maybe whatever we settle on here should be used in augur filter too?

jameshadfield · 2024-12-16T20:09:43Z

I'm assuming "treat things independently" means something like this [...]

Not quite. As per the linked example I mean something like:

augur merge --metadata A=A.tsv B=B.tsv --output-metadata merged.tsv

seqkit rmdup A.seg1.fasta B.seg1.fasta > merged.seg1.fasta

seqkit rmdup A.seg2.fasta B.seg2.fasta > merged.seg2.fasta

In your first example (with two augur merge commands), if there is a strain which is in seg1 but not seg2 I would have thought this is dropped from the resulting merged.tsv. Testing 81b22d7 shows this is not the case. I didn't expect this, even if it is convenient in this case.

Apart from the (nice) sanity checking of command line arguments, I'm not quite understanding the difference between a joint merge of metadata + sequences vs independent merging. I'll make a reminder to circle back here in the new year and revisit.

tsibley added the enhancement New feature or request label Aug 15, 2024

tsibley self-assigned this Aug 15, 2024

victorlin mentioned this issue Aug 15, 2024

augur merge #1563

Merged

7 tasks

tsibley assigned victorlin and unassigned tsibley Aug 21, 2024

This was referenced Aug 27, 2024

merge: Support sequences with cross-checking #1601

Open

Subcommand for data de-duplication #919

Open

victorlin mentioned this issue Sep 16, 2024

merge: Support sequences without cross-checking #1631

Closed

6 tasks

victorlin mentioned this issue Nov 15, 2024

Provide a generic pattern for including additional user data alongside curated data nextstrain/pathogen-repo-guide#72

Open

jameshadfield mentioned this issue Nov 24, 2024

work needed to allow users to run analyses separate to pathogen workflows nextstrain/pathogen-repo-guide#74

Open

5 tasks

victorlin mentioned this issue Dec 16, 2024

Use augur merge for sequences nextstrain/zika#76

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge: Support sequences #1579

merge: Support sequences #1579

tsibley commented Aug 15, 2024 •

edited by victorlin

Loading

tsibley commented Aug 21, 2024

victorlin commented Aug 29, 2024 •

edited

Loading

tsibley commented Aug 31, 2024

victorlin commented Sep 3, 2024

tsibley commented Sep 3, 2024

victorlin commented Sep 3, 2024

tsibley commented Sep 4, 2024

victorlin commented Sep 4, 2024

victorlin commented Sep 16, 2024

victorlin commented Sep 25, 2024 •

edited

Loading

victorlin commented Dec 16, 2024

victorlin commented Dec 16, 2024

jameshadfield commented Dec 16, 2024

merge: Support sequences #1579

merge: Support sequences #1579

Comments

tsibley commented Aug 15, 2024 • edited by victorlin Loading

Design discussion points

Possible solutions

tsibley commented Aug 21, 2024

victorlin commented Aug 29, 2024 • edited Loading

tsibley commented Aug 31, 2024

victorlin commented Sep 3, 2024

tsibley commented Sep 3, 2024

victorlin commented Sep 3, 2024

tsibley commented Sep 4, 2024

victorlin commented Sep 4, 2024

victorlin commented Sep 16, 2024

victorlin commented Sep 25, 2024 • edited Loading

victorlin commented Dec 16, 2024

victorlin commented Dec 16, 2024

jameshadfield commented Dec 16, 2024

tsibley commented Aug 15, 2024 •

edited by victorlin

Loading

victorlin commented Aug 29, 2024 •

edited

Loading

victorlin commented Sep 25, 2024 •

edited

Loading