Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing unique k-mers between samples #2383

Open
jessicalumian opened this issue Nov 30, 2022 · 26 comments
Open

Removing unique k-mers between samples #2383

jessicalumian opened this issue Nov 30, 2022 · 26 comments
Labels
code herein lies code example

Comments

@jessicalumian
Copy link

Hello! I am working on replicating some (amazing) work from Taylor's IBD project.

I have signatures from sourmash sketch dna of metagenome samples. I would like to remove k-mers that are only present in 1 sample with the hope this will reduce noise in comparison between samples.

It looks like Taylor does this here.

Would it be recommended to phagocytize this code, or write something else? Thank you!

@ctb
Copy link
Contributor

ctb commented Nov 30, 2022

That code looks good to me!

Note that the output is a text file with just the relevant hashes in it (and not their counts); if you need a different format, e.g. something suitable for use with sourmash sig inflate, lmk.

@jessicalumian
Copy link
Author

Ok cool! Because this is dealing with hashes, does that mean that k-mer abundance information is retained? Or would that require sourmash sig inflate?

I'm not totally clear on how inflate works. Is this correct:

Each sample has its own signature file. sample_A.sig has a hash is specific to the abundance of a k-mer. If sample_B.sig has that same hash, then sourmash sig inflate would represent that hash in the signature of sample B?

@ctb
Copy link
Contributor

ctb commented Nov 30, 2022

Let me see if this response helps - ask away if not :)

when signatures are calculated with sketch dna -p abund the signature stores both a hash (representing a k-mer) and an abundance (representing the multiplicity of that k-mer). Some operations can be done only on signatures with abundances, others can be done only on signatures without abundances, and some operations can be done on both.

sourmash sig flatten removes abundances.

sourmash sig inflate transfers abundances from one signature onto the hashes in another, where those hashes are in common; hashes that are not present in the intersection of the abund signature and the flat signature, they are ignored.

An exercise you might try - sourmash sig flatten a.sig -o b.sig then sourmash sig inflate a.sig b.sig -o c.sig should produce a c.sig file that is equivalent to a.sig. Although... it's actually kind of hard to track and diagnose abundances and so on. Let us know what would help you debug things, as you proceed!

what I did to verify that a.sig and c.sig are identical -

I ran sourmash search a.sig b.sig and sourmash search a.sig c.sig. The a-vs-b failed because b didn't have abundance, the a-vs-c suceeded with a 100% similarity.

other tricks

sourmash sig describe will tell you if a signature has abundances.

@ctb
Copy link
Contributor

ctb commented Nov 30, 2022

it's not hard to change the code in @taylorreiter Snakefile to output an abundance sig. I'll do that when I get a chance!

@ctb
Copy link
Contributor

ctb commented Nov 30, 2022

Check this 'un out -

https://github.com/ctb/2022-sourmash-filter-min-samples

@ctb
Copy link
Contributor

ctb commented Nov 30, 2022

(some things to be done to improve, but it's a start!)

@jessicalumian
Copy link
Author

That explanation helps a lot!! I'm trying to picture a use case for inflate and flatten. The goal to be able to add or take away abundance information to allow you to do sourmash operations that are dependent on sigs being inflated or flat? Is there every a use case where you would inflate the abundance of one sample's signature onto a separate sample's flat signature? (I can't think of a reason for this, and this was my mental block before your explanation.)

Thank you for the python code!!! sneaking in a snakemake question: the best practice would be to have this script in the directory with my snakefile and call it in a rule as opposed to putting this code directly in the snakefile, right?

@ctb
Copy link
Contributor

ctb commented Dec 1, 2022

That explanation helps a lot!! I'm trying to picture a use case for inflate and flatten. The goal to be able to add or take away abundance information to allow you to do sourmash operations that are dependent on sigs being inflated or flat? Is there every a use case where you would inflate the abundance of one sample's signature onto a separate sample's flat signature? (I can't think of a reason for this, and this was my mental block before your explanation.)

flatten is important for situations where you need to or want to get rid of abundances - there are various other sourmash commands that require flat signatures. Like, you have to run gather against genomes, which have to be flat. This in turn is so that we can have error messages that say "what you're doing doesn't make sense/we don't have any way to do that" as a way to guide the user that Here Be Dragons.

inflate has come up mostly in situations like the one you're facing here -

  • I select hashes based on some weird criterion like "all hashes that appear in more than one file", in which it doesn't make sense to keep abundances.
  • so I calculate that, and then intersect all my individual signatures with those hashes.
  • but now I want the abundances back for the hashes that I kept!! please help!

Basically there are weird things power users sometimes want to do and we're trying to enable them in a sensible way where common things are actually available via the sourmash command line, and thus power users only need to do very oddball things via scripting.

tl;dr no one clear use case, but many fuzzy ones, so we supported it :)

Thank you for the python code!!! sneaking in a snakemake question: the best practice would be to have this script in the directory with my snakefile and call it in a rule as opposed to putting this code directly in the snakefile, right?

yep! it's not always clear it matters but if it starts as a separate script then might as well keep it that way!

@ctb ctb added code herein lies code example labels Dec 1, 2022
@ctb
Copy link
Contributor

ctb commented Dec 1, 2022

Note to self: sourmash sig export and import per #1098 would help demonstrate/check/evaluate abundance stuff in signatures.

@ctb
Copy link
Contributor

ctb commented Dec 1, 2022

(Updated the git repo with comments and README.)

@ctb
Copy link
Contributor

ctb commented Dec 1, 2022

Thinking of integrating this into sourmash sig as either commonhash or occurrences - any thoughts?

@taylorreiter
Copy link
Contributor

I love the idea of integrating this into sourmash sig! I saw commonhash and liked it -- i think as long as it's documented then occurrences is also fine!

@jessicalumian -- I used the code you linked above to create a signature that had all of the hashes that I was interested in keeping. Then, I took each of my metagenome signatures and used sourmash sig intersect to only keep the hashes of interest. There are definitely a bunch of different ways to achieve the same thing! Just wanted to add a note about the approach i took.

@jessicalumian
Copy link
Author

I ran the code from Titus and I have 1 giant signature of everything I want to keep from all my samples (woo!). Because I'm never out of questions:

If I use sourmash sig intersect I see that abundances will be lost. But if I use --abundance-from, the abundances will be transferred from my 1 giant signature to the individual signatures that are made. Is this correct? (Or, I could use sourmash sig inflate to transfer abundances.)

Does it make sense to retain abundances for samples that have unique k-mers removed? It would be nice to see if microbe X is highly abundant in one sample and low in another. It seems like the removal of unique k-mers wouldn't affect this comparison because abundance was calculated from the original metagenome, right?

@ctb
Copy link
Contributor

ctb commented Dec 1, 2022

so I was worried this would happen 😆

you should actually have one gigantic file containing many signatures! Try running sourmash sig summarize <filename>.

You can use sourmash sig split to split those out into individual sig files, or output them directly that way using a directory output, e.g. -o foo/. See output formats for more info!

@jessicalumian
Copy link
Author

ooooh ok I might actually have 1 file with multiple signatures! This is the output of sourmash sig summarize

** loading from 'kmers_21_in_multiple_samples_with_abundance.sig'
path filetype: MultiIndex
location: kmers_21_in_multiple_samples_with_abundance.sig
is database? no
has manifest? yes
num signatures: 20
** examining manifest...
total hashes: 1243224
summary of sketches:
   20 sketches with DNA, k=21, scaled=2000, abund     1243224 total hashes

This would indicate that I actually have 1 file containing my 20 signatures (I had 20 input files in this case), right? After running sourmash sig split I now have 20 signature files, which are the signatures from my initial samples with unique k-mers removed.

The names have a hash and some stuff (like the example here), but I can rename them with sourmash sig rename.

Sourmash: There's a command for that. 💪 ✨

@taylorreiter
Copy link
Contributor

(this is so awesome)

@ccbaumler
Copy link
Contributor

For clarity, this workflow is:

sourmash sketch dna -p abund *.fq.gz followed by ./filter-min-samples.py *.sig -o filtered.zip or ./filter-min-samples.py *.sig -o ./output/ (from this repo). This will output a collection of signatures of only commonly shared hashes across the intersect of all individual signatures included while retaining the abundances stored in each individual signature.

Correct?

@jessicalumian
Copy link
Author

jessicalumian commented Dec 2, 2022

yep! then sourmash sig split to split into separate signature files, then renaming sig files if desired. so:

  1. create signatures from metagenomes with abundance tracking sourmash sketch dna -p abund *.fq.gz
  2. remove k-mers only present in 1 sample with filter-min-samples.py to create frakensignature file
  3. split up frakensignature file into individual sig files with sourmash sig split {input} -gz -o outdir (I haven't figured out the smart way to do this in a snakefile yet because the output names are hashes and other info (example), do for now I defined my output as the directory that will be created

@ctb
Copy link
Contributor

ctb commented Dec 2, 2022

(very evocative names...)

you can actually do a lot of stuff with the frakensignature file, which is pretty snakemake friendly. Question is what you want to actually do :).

note that the sketches should still be named by whatever their sample name is; see sourmash sig describe <frakensignature.zip>. Then sourmash sig cat --include <name> -o <name>.sig.gz will split out the names etc.

(Basically, I try to not using named .sig files if I can avoid it, because the .zip is so much more convenient. But it might require retooling some of the specific workflow steps. not sure.)

@ccbaumler
Copy link
Contributor

ccbaumler commented Dec 14, 2022

Wild idea. Thinking about Branchwater in relation to this technique. We have attempted to extract metagenome sequences from the SRA with random forest classified signatures. That was not as informative as we had hoped with that particular search.

Could we instead identify a set of metagenomes of a particular phenotype (e.g. IBD) and isolate the shared common kmers of that phenotype? Subtracting a control phenotype common kmer set from the case phenotype signature either before or after creating the case common kmer sig.

Edit: Or, in addition to subtracting control phenotype common hash sets, we could subtract unique cases from each ohter as well.
Edit2: I suppose you could call this identifying the core genomic components of phenotypic metagenome data

@taylorreiter
Copy link
Contributor

So you're suggesting:

  1. Find IBD metagenomes
  2. Find other gut metagenomes from non-ibd data sets
  3. from the IBD metagenomes, subtract hashes that are in signatures of the other gut metagenomes
  4. do stuff with the leftovers

@ccbaumler
Copy link
Contributor

ccbaumler commented Dec 14, 2022

4a. branchwater search using the core genomic components of each common hash set and compare the results

@ctb
Copy link
Contributor

ctb commented Dec 16, 2022

(I'm tempted to suggest we move this to a new issue, since it's diverging pretty far from the issue title - but we can do that later I guess ;)

I would be very interested also/separately in calculating Shannon entropy / information with respect to metadata. I have code 🤷

@ctb
Copy link
Contributor

ctb commented Jan 6, 2023

Trying this out with the CLI plugin infrastructure #2438 - see PR ctb/2022-sourmash-filter-min-samples#1.

Kinda neat - when all the machinery works, you get the ability to run:

% sourmash scripts commonhash ../sourmash/podar-ref/*.sig  -o xxx.sig

@ctb
Copy link
Contributor

ctb commented Jul 31, 2023

The code has now been moved from https://github.com/ctb/2022-sourmash-filter-min-samples to https://github.com/ctb/sourmash_plugin_commonhash.

Leaving this issue open because it has a lot of good discussion that we should put in advanced documentation or something.

@mr-eyes
Copy link
Member

mr-eyes commented Feb 2, 2024

I think this fits here.
https://github.com/mr-eyes/sourmash_plugin_extract_errors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code herein lies code example
Projects
None yet
Development

No branches or pull requests

5 participants