Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge: Support sequences #1579

Open
tsibley opened this issue Aug 15, 2024 · 13 comments · May be fixed by #1601
Open

merge: Support sequences #1579

tsibley opened this issue Aug 15, 2024 · 13 comments · May be fixed by #1601
Assignees
Labels
enhancement New feature or request

Comments

@tsibley
Copy link
Member

tsibley commented Aug 15, 2024

augur merge should support input --sequences and --output-sequences.

Features:

  • Sequences with same id are coalesced right-to-left, as with metadata rows.
  • Duplication handling within FASTA file? Compare to duplication handling within metadata file (i.e. no dups allowed).

Prior art:

See previous discussion in the PR which introduced augur merge.

Design discussion points

  • How to handle/require named inputs? (ref)
    • Names should be optional.
    • If provided, there should be cross-checking with named metadata.
  • How to handle segment-specific sequence files? (ref)

Possible solutions

  1. merge: Support sequences without cross-checking #1631
  2. merge: Support sequences with cross-checking #1601
@tsibley tsibley added the enhancement New feature or request label Aug 15, 2024
@tsibley tsibley self-assigned this Aug 15, 2024
@victorlin victorlin mentioned this issue Aug 15, 2024
7 tasks
@tsibley
Copy link
Member Author

tsibley commented Aug 21, 2024

Assigning to @victorlin per planning doc commentary.

@victorlin
Copy link
Member

victorlin commented Aug 29, 2024

The proposed workaround:

augur merge --metadata a=a.tsv b=b.csv --output-metadata c.tsv
cat a.fasta b.fasta > all.fasta
augur filter --metadata c.tsv --sequences all.fasta --output-sequences c.fasta

does not work in the case of an entry existing in both a and b. augur merge will de-duplicate while cat keeps both entries, so augur filter would not properly handle (and should error on) the duplicates in all.fasta.

A proper workaround would be using seqkit rmdup in place of cat:

augur merge --metadata a=a.tsv b=b.csv --output-metadata c.tsv
seqkit rmdup b.fasta a.fasta > c.fasta
augur filter --metadata c.tsv --sequences c.fasta --output-metadata merged.tsv --output-sequences merged.fasta

noting that the order of b.fasta a.fasta is necessary to keep consistent with augur merge behavior of coalescing right-to-left. The purpose of augur filter is then to remove any entries present in sequences that are missing from metadata (and vice versa).

The workaround is still not great since the metadata and sequence files are loosely coupled. augur merge could more tightly couple this by requiring the NAME=FILE input format. Take these hypothetical user errors and warnings as an example:

augur merge \
  --metadata a=a.tsv b=b.csv \
  --sequences b=b.fasta a=a.fasta \
  --output-metadata merged.tsv \
  --output-sequences merged.fasta
# ERROR: Order of inputs differs between metadata (a,b) and sequences (b,a).
# Please update one to match the other, noting that when an entry is in multiple
# inputs, only the entry in the last input will be kept.

augur merge \
  --metadata a=a.tsv b=b.csv \
  --sequences a=a.fasta b=b.fasta c=c.fasta \
  --output-metadata merged.tsv \
  --output-sequences merged.fasta
# ERROR: Sequence file (c.fasta) does not have a corresponding metadata file.

augur merge \
  --metadata a=a.tsv b=b.csv \
  --sequences a=a.fasta b=b.fasta \
  --output-metadata c.tsv
  --output-sequences c.fasta
# WARNING: Sequence `XXX` in a.tsv is missing from a.fasta. It will not be present in any output.
# WARNING: Sequence `YYY` in b.fasta is missing from b.csv. It will not be present in any output.

@tsibley
Copy link
Member Author

tsibley commented Aug 31, 2024

Hmm. I'm not sure about these example errors. I know the original thinking was to pair metadata and sequences via the names given, but upon further reflection, I'm not sure we should require it.

# ERROR: Order of inputs differs between metadata (a,b) and sequences (b,a).
# Please update one to match the other, noting that when an entry is in multiple
# inputs, only the entry in the last input will be kept.

Why require a matched order? That seems unfriendly to me, like asking the user to do unnecessary tedium for the computer's sake. I'd think to either

a. require named sequence inputs and make the order of --metadata be the authoritative one regardless of order in --sequences, or
b. don't require named sequence inputs and make the order of --sequences be the order to merge sequences, regardless of the order of --metadata. Names, if given, would allow for stricter warnings/validation of the metadata matching sequences.

Of the two, (b) would be better, I think.

If (a) and we want to support invocations without --metadata (i.e. --sequences only), then the order given to --sequences can matter in that specific case. For (b), this isn't an issue.

# ERROR: Sequence file (c.fasta) does not have a corresponding metadata file.

It seems to me to be reasonable/likely to have two or more paired metadata/sequence files, plus an extra file or two of sequences (e.g. corrected sequences) that don't have or need their own separate metadata. With strictly named pairs of metadata/sequence files, that wouldn't be possible and a stub metadata file would have to be fabricated for the sequences to avoid this error.

# WARNING: Sequence `XXX` in a.tsv is missing from a.fasta. It will not be present in any output.
# WARNING: Sequence `YYY` in b.fasta is missing from b.csv. It will not be present in any output.

These warnings seem good to me in general, but may need tweaking if behaviour (a) or (b) above is chosen instead.

@victorlin
Copy link
Member

Before debating requirement of a matched order we should first settle on whether to require named sequence inputs.

The only way to have these warnings

# WARNING: Sequence `XXX` in a.tsv is missing from a.fasta. It will not be present in any output.
# WARNING: Sequence `YYY` in b.fasta is missing from b.csv. It will not be present in any output.

is if the different inputs could be paired together (e.g. via named sequence inputs). Without this pairing, there is not much benefit to allowing --metadata and --sequences in the same command. I think at that point, it would be more clear if separate commands are used to signify that no pairing/validation is happening:

augur merge --metadata a=a.tsv b=b.csv --output-metadata c.tsv
augur merge --sequences b.fasta a.fasta --output-sequences c.fasta

How about (c) require named sequence inputs when used with metadata? I'll provide an example below.

It seems to me to be reasonable/likely to have two or more paired metadata/sequence files, plus an extra file or two of sequences (e.g. corrected sequences) that don't have or need their own separate metadata.

This is reasonable, thanks for providing the example. As an extension: suppose there are datasets A (a.tsv, a.fasta) and B (b.tsv, b.fasta, b_corrected.fasta). We could support all in one command:

augur merge \
  --metadata a=a.tsv b=b.csv \
  --sequences a.fasta b.fasta b_corrected.fasta \
  --output-metadata merged.tsv \
  --output-sequences merged.fasta

but that wouldn't do any validation between metadata/sequences. It might as well be separate commands:

augur merge \
  --metadata a=a.tsv b=b.csv \
  --output-metadata merged.tsv
augur merge \
  --sequences a.fasta b.fasta b_corrected.fasta \
  --output-sequences merged.fasta

(c) would allow for this:

# Create single FASTA file for B
augur merge \
  --sequences b.fasta b_corrected.fasta \
  --output-sequences b_corrected_merged.fasta

# Merge with paired validation
augur merge \
  --metadata a=a.tsv b=b.csv \
  --sequences a=a.fasta b=b_corrected_merged.fasta \
  --output-metadata merged.tsv \
  --output-sequences merged.fasta
# WARNING: Sequence `XXX` in a.tsv is missing from a.fasta. It will not be present in any output.
# WARNING: Sequence `YYY` in b_corrected_merged.fasta is missing from b.csv. It will not be present in any output.

@tsibley
Copy link
Member Author

tsibley commented Sep 3, 2024

Some specific thoughts below, but the big picture is that I think we'll want to gather more insight into existing use cases and feedback from potential users on the behaviour that's most useful here.

The only way to have these warnings […] is if the different inputs could be paired together (e.g. via named sequence inputs).

Couldn't we have the warnings list all files if not paired by name?

# WARNING: Sequence `XXX` in a.tsv is missing from all sequence files.
# WARNING: Sequence `YYY` in b.fasta is missing from all metadata tables.

Or allow optional pairing by name, and multiple sequence inputs per name, to enable the warnings, but not require naming?

augur merge \
  --metadata a=a.tsv b=b.csv \
  --sequences a=a.fasta b=b.fasta b=b_corrected.fasta cross_input_corrected.fasta \
  --output-metadata merged.tsv \
  --output-sequences merged.fasta

Separately, when your example error says, "It will not be present in any output", does that mean missing sequences will filter out metadata records? I'm not sure that's what we want for augur merge.

It might as well be separate commands:

Except that as separate commands it'll take up more (potentially much more) transient disk space that's not otherwise required.

@victorlin
Copy link
Member

Couldn't we have the warnings list all files if not paired by name?
Or allow optional pairing by name, and multiple sequence inputs per name, to enable the warnings, but not require naming?

Sure, both of those seem reasonable.

Separately, when your example error says, "It will not be present in any output", does that mean missing sequences will filter out metadata records? I'm not sure that's what we want for augur merge.

Yes. That's how augur filter works currently, and I'm assuming we want to stay consistent:

10 strains were dropped during filtering
1 had no metadata
1 had no sequence data
1 was dropped because it was earlier than 2012.0 or missing a date
1 was dropped during grouping due to ambiguous month information
6 were dropped because of subsampling criteria, using seed 314159
3 strains passed all filters

as separate commands it'll take up more (potentially much more) transient disk space that's not otherwise required.

If augur merge --sequences a.fasta b.fasta is a wrapper for seqkit rmdup b.fasta a.fasta, I don't see any overhead with separate commands. But this could change with implementation.

the big picture is that I think we'll want to gather more insight into existing use cases and feedback from potential users on the behaviour that's most useful here.

👍 good discussion of possibilities so far. Some potential questions for feedback:

  • Is it beneficial to validate that an entry is present in both files of an input pair?
    Example: warn if an entry in a.tsv is missing from a.fasta

  • Is it beneficial to aggregate input types and validate that an entry is present in both types?
    Example: warn if an entry in a.tsv + b.tsv is missing from a.fasta + b.fasta

  • In this example, would it be beneficial or confusing if SEQ2 in c.tsv came from b.tsv while in c.fasta it came from a.fasta?

    # a: SEQ1, SEQ2
    # b:       SEQ2, SEQ3
    
    augur merge \
      --metadata a=a.tsv b=b.tsv \
      --sequences b=b.fasta a=a.fasta \
      --output-metadata c.tsv
      --output-sequences c.fasta

@tsibley
Copy link
Member Author

tsibley commented Sep 4, 2024

Yes. That's how augur filter works currently, and I'm assuming we want to stay consistent:

Yeah, I know that's how augur filter works. I guess I think of merging as something separate than a filter step here. It'd be weird IMO for data to be filtered out by augur merge.

If augur merge --sequences a.fasta b.fasta is a wrapper for seqkit rmdup b.fasta a.fasta, I don't see any overhead with separate commands. But this could change with implementation.

Ah, I meant in this example of yours:

# Create single FASTA file for B
augur merge \
  --sequences b.fasta b_corrected.fasta \
  --output-sequences b_corrected_merged.fasta

# Merge with paired validation
augur merge \
  --metadata a=a.tsv b=b.csv \
  --sequences a=a.fasta b=b_corrected_merged.fasta \
  --output-metadata merged.tsv \
  --output-sequences merged.fasta

where b_corrected_merged.fasta is the extra disk space needed.

@victorlin
Copy link
Member

Yeah, I know that's how augur filter works. I guess I think of merging as something separate than a filter step here. It'd be weird IMO for data to be filtered out by augur merge.

That's a good point. My mind was stuck in augur filter land but it sounds reasonable to keep filtering out of augur merge. In that case, is there any point in having paired validation between input types? If not, the simplest approach would be to implement metadata and sequence merge support to be mutually exclusive. This would avoid the complexities that arise when allowing them to be used together such as which order to coalesce and the extent of paired validation.

In the future, if paired validation becomes necessary in workflows, we can consider adding support for both metadata+sequences in a single command as an additional feature.

@victorlin
Copy link
Member

To summarize, there are two different approaches:

  1. merge: Support sequences without cross-checking #1631
  2. merge: Support sequences with cross-checking #1601

(1) is much simpler to implement: the bulk of it is an alias to seqkit rmdup.

(2) needs work to first define how much cross-checking to do (this was discussed above). Depending on the amount of cross-checking, it may be necessary to:

  • split the metadata merging logic into database import / merge output
  • read the sequence IDs into a database table
  • add validation across metadata and sequence tables

The prototypes incorporate all of the above and should be functional enough to help decide which approach we want to take, at least initially.

@victorlin
Copy link
Member

victorlin commented Sep 25, 2024

Thinking through this again, we could allow all these scenarios in the same implementation:

augur merge \
  --metadata a=a.tsv b=b.csv \
  --sequences c.fasta b.fasta a.fasta
# WARNING: Sequence inputs are unnamed. Skipping validation between metadata and sequences.

augur merge \
  --metadata a=a.tsv b=b.csv \
  --sequences c=c.fasta b=b.fasta a=a.fasta
# ERROR: Order of inputs differs between metadata (a,b) and sequences (c,b,a).
# Please update one to match the other, noting that when an entry is in multiple
# inputs, only the entry in the last input will be kept.
# Alternatively, use unnamed sequence inputs to skip validation between
# metadata and sequences.

augur merge \
  --metadata a=a.tsv b=b.csv c=c.tsv \
  --sequences a=a.fasta b=b.fasta c=c.fasta
# WARNING: Sequence `XXX` in a.tsv is missing from a.fasta. It will not be present in any output.
# WARNING: Sequence `YYY` in b.fasta is missing from b.csv. It will not be present in any output.

I'll gather feedback on this interface in Slack.

@victorlin
Copy link
Member

@jameshadfield posted a good reminder about segment-specific sequence files in nextstrain/zika#76 (comment). Copy/pasted below to continue discussion.


This seems clear-cut for unsegmented pathogens, whereby we have a single augur merge command which collects all n metadata TSVs & m sequence FASTAs into a single TSV & FASTA.

For segmented genomes our analyses use a single TSV and 1 FASTA per segment. So the metadata includes samples which may be in only one FASTA (e.g. only one segment was sequenced). AFAIK we haven't got prototypes for how to merge multiple inputs for segmented pathogens beyond my prototype in avian-flu which handles metadata merging and sequence merging independently. It's not clear what's the best approach here.

We could continue to treat things independently. When augur merge is given both sequences & metadata does it do more than drop rows?

Should augur merge be aware of segments? e.g. you could specify something like --sequences usvi=seg1=<path> usvi=seg2=<path>.

@victorlin
Copy link
Member

When augur merge is given both sequences & metadata does it do more than drop rows?

Yes, it will

I'm assuming "treat things independently" means something like this:

augur merge
  --metadata
      ingest=ingest.tsv
      usvi=usvi.tsv
  --sequences
      ingest=ingest-seg1.fasta
      usvi=usvi-seg1.fasta
  --output-metadata
      merged.tsv
  --output-sequences
      merged-seg1.fasta

augur merge
  --metadata
      ingest=ingest.tsv
      usvi=usvi.tsv
  --sequences
      ingest=ingest-seg2.fasta
      usvi=usvi-seg2.fasta
  --output-metadata
      merged.tsv
  --output-sequences
      merged-seg2.fasta

This should work just fine, but not optimally since the same metadata would be loaded into SQLite multiple times (and looks redundant to the user).

Should augur merge be aware of segments?

I think that's the only way to have a more optimal implementation. Instead of the usvi=seg1=<path> syntax, another option that keeps the existing name=file syntax would be to allow arbitrary --[output-]sequences-*:

augur merge
  --metadata
      ingest=ingest.tsv
      usvi=usvi.tsv
  --sequences-seg1
      ingest=ingest-seg1.fasta
      usvi=usvi-seg1.fasta
  --sequences-seg2
      ingest=ingest-seg2.fasta
      usvi=usvi-seg2.fasta
  --output-metadata
      merged.tsv
  --output-sequences-seg1
      merged-seg1.fasta
  --output-sequences-seg2
      merged-seg2.fasta

This retains the 1:many relationship between metadata and sequence files, but deviates from the 1:1 relationship in augur filter that's reflected in the current implementation. Maybe whatever we settle on here should be used in augur filter too?

@jameshadfield
Copy link
Member

I'm assuming "treat things independently" means something like this [...]

Not quite. As per the linked example I mean something like:

augur merge --metadata A=A.tsv B=B.tsv --output-metadata merged.tsv

seqkit rmdup A.seg1.fasta B.seg1.fasta > merged.seg1.fasta

seqkit rmdup A.seg2.fasta B.seg2.fasta > merged.seg2.fasta

In your first example (with two augur merge commands), if there is a strain which is in seg1 but not seg2 I would have thought this is dropped from the resulting merged.tsv. Testing 81b22d7 shows this is not the case. I didn't expect this, even if it is convenient in this case.

Apart from the (nice) sanity checking of command line arguments, I'm not quite understanding the difference between a joint merge of metadata + sequences vs independent merging. I'll make a reminder to circle back here in the new year and revisit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
3 participants