Removing unique k-mers between samples #2383

jessicalumian · 2022-11-30T20:59:22Z

Hello! I am working on replicating some (amazing) work from Taylor's IBD project.

I have signatures from sourmash sketch dna of metagenome samples. I would like to remove k-mers that are only present in 1 sample with the hope this will reduce noise in comparison between samples.

It looks like Taylor does this here.

Would it be recommended to phagocytize this code, or write something else? Thank you!

The text was updated successfully, but these errors were encountered:

ctb · 2022-11-30T21:10:12Z

That code looks good to me!

Note that the output is a text file with just the relevant hashes in it (and not their counts); if you need a different format, e.g. something suitable for use with sourmash sig inflate, lmk.

jessicalumian · 2022-11-30T21:24:09Z

Ok cool! Because this is dealing with hashes, does that mean that k-mer abundance information is retained? Or would that require sourmash sig inflate?

I'm not totally clear on how inflate works. Is this correct:

Each sample has its own signature file. sample_A.sig has a hash is specific to the abundance of a k-mer. If sample_B.sig has that same hash, then sourmash sig inflate would represent that hash in the signature of sample B?

ctb · 2022-11-30T22:04:21Z

Let me see if this response helps - ask away if not :)

when signatures are calculated with sketch dna -p abund the signature stores both a hash (representing a k-mer) and an abundance (representing the multiplicity of that k-mer). Some operations can be done only on signatures with abundances, others can be done only on signatures without abundances, and some operations can be done on both.

sourmash sig flatten removes abundances.

sourmash sig inflate transfers abundances from one signature onto the hashes in another, where those hashes are in common; hashes that are not present in the intersection of the abund signature and the flat signature, they are ignored.

An exercise you might try - sourmash sig flatten a.sig -o b.sig then sourmash sig inflate a.sig b.sig -o c.sig should produce a c.sig file that is equivalent to a.sig. Although... it's actually kind of hard to track and diagnose abundances and so on. Let us know what would help you debug things, as you proceed!

what I did to verify that `a.sig` and `c.sig` are identical -

I ran sourmash search a.sig b.sig and sourmash search a.sig c.sig. The a-vs-b failed because b didn't have abundance, the a-vs-c suceeded with a 100% similarity.

other tricks

sourmash sig describe will tell you if a signature has abundances.

ctb · 2022-11-30T22:05:03Z

it's not hard to change the code in @taylorreiter Snakefile to output an abundance sig. I'll do that when I get a chance!

ctb · 2022-11-30T22:25:06Z

Check this 'un out -

https://github.com/ctb/2022-sourmash-filter-min-samples

ctb · 2022-11-30T22:25:21Z

(some things to be done to improve, but it's a start!)

jessicalumian · 2022-12-01T00:30:24Z

That explanation helps a lot!! I'm trying to picture a use case for inflate and flatten. The goal to be able to add or take away abundance information to allow you to do sourmash operations that are dependent on sigs being inflated or flat? Is there every a use case where you would inflate the abundance of one sample's signature onto a separate sample's flat signature? (I can't think of a reason for this, and this was my mental block before your explanation.)

Thank you for the python code!!! sneaking in a snakemake question: the best practice would be to have this script in the directory with my snakefile and call it in a rule as opposed to putting this code directly in the snakefile, right?

ctb · 2022-12-01T00:38:07Z

That explanation helps a lot!! I'm trying to picture a use case for inflate and flatten. The goal to be able to add or take away abundance information to allow you to do sourmash operations that are dependent on sigs being inflated or flat? Is there every a use case where you would inflate the abundance of one sample's signature onto a separate sample's flat signature? (I can't think of a reason for this, and this was my mental block before your explanation.)

flatten is important for situations where you need to or want to get rid of abundances - there are various other sourmash commands that require flat signatures. Like, you have to run gather against genomes, which have to be flat. This in turn is so that we can have error messages that say "what you're doing doesn't make sense/we don't have any way to do that" as a way to guide the user that Here Be Dragons.

inflate has come up mostly in situations like the one you're facing here -

I select hashes based on some weird criterion like "all hashes that appear in more than one file", in which it doesn't make sense to keep abundances.
so I calculate that, and then intersect all my individual signatures with those hashes.
but now I want the abundances back for the hashes that I kept!! please help!

Basically there are weird things power users sometimes want to do and we're trying to enable them in a sensible way where common things are actually available via the sourmash command line, and thus power users only need to do very oddball things via scripting.

tl;dr no one clear use case, but many fuzzy ones, so we supported it :)

Thank you for the python code!!! sneaking in a snakemake question: the best practice would be to have this script in the directory with my snakefile and call it in a rule as opposed to putting this code directly in the snakefile, right?

yep! it's not always clear it matters but if it starts as a separate script then might as well keep it that way!

ctb · 2022-12-01T14:53:19Z

Note to self: sourmash sig export and import per #1098 would help demonstrate/check/evaluate abundance stuff in signatures.

ctb · 2022-12-01T15:03:13Z

(Updated the git repo with comments and README.)

ctb · 2022-12-01T15:08:53Z

Thinking of integrating this into sourmash sig as either commonhash or occurrences - any thoughts?

taylorreiter · 2022-12-01T15:23:47Z

I love the idea of integrating this into sourmash sig! I saw commonhash and liked it -- i think as long as it's documented then occurrences is also fine!

@jessicalumian -- I used the code you linked above to create a signature that had all of the hashes that I was interested in keeping. Then, I took each of my metagenome signatures and used sourmash sig intersect to only keep the hashes of interest. There are definitely a bunch of different ways to achieve the same thing! Just wanted to add a note about the approach i took.

jessicalumian · 2022-12-01T18:37:18Z

I ran the code from Titus and I have 1 giant signature of everything I want to keep from all my samples (woo!). Because I'm never out of questions:

If I use sourmash sig intersect I see that abundances will be lost. But if I use --abundance-from, the abundances will be transferred from my 1 giant signature to the individual signatures that are made. Is this correct? (Or, I could use sourmash sig inflate to transfer abundances.)

Does it make sense to retain abundances for samples that have unique k-mers removed? It would be nice to see if microbe X is highly abundant in one sample and low in another. It seems like the removal of unique k-mers wouldn't affect this comparison because abundance was calculated from the original metagenome, right?

ctb · 2022-12-01T18:41:27Z

so I was worried this would happen 😆

you should actually have one gigantic file containing many signatures! Try running sourmash sig summarize <filename>.

You can use sourmash sig split to split those out into individual sig files, or output them directly that way using a directory output, e.g. -o foo/. See output formats for more info!

jessicalumian · 2022-12-01T18:57:27Z

ooooh ok I might actually have 1 file with multiple signatures! This is the output of sourmash sig summarize

** loading from 'kmers_21_in_multiple_samples_with_abundance.sig'
path filetype: MultiIndex
location: kmers_21_in_multiple_samples_with_abundance.sig
is database? no
has manifest? yes
num signatures: 20
** examining manifest...
total hashes: 1243224
summary of sketches:
   20 sketches with DNA, k=21, scaled=2000, abund     1243224 total hashes

This would indicate that I actually have 1 file containing my 20 signatures (I had 20 input files in this case), right? After running sourmash sig split I now have 20 signature files, which are the signatures from my initial samples with unique k-mers removed.

The names have a hash and some stuff (like the example here), but I can rename them with sourmash sig rename.

Sourmash: There's a command for that. 💪 ✨

taylorreiter · 2022-12-01T19:02:58Z

(this is so awesome)

ccbaumler · 2022-12-01T22:52:40Z

For clarity, this workflow is:

sourmash sketch dna -p abund *.fq.gz followed by ./filter-min-samples.py *.sig -o filtered.zip or ./filter-min-samples.py *.sig -o ./output/ (from this repo). This will output a collection of signatures of only commonly shared hashes across the intersect of all individual signatures included while retaining the abundances stored in each individual signature.

Correct?

jessicalumian · 2022-12-02T01:29:51Z

yep! then sourmash sig split to split into separate signature files, then renaming sig files if desired. so:

create signatures from metagenomes with abundance tracking sourmash sketch dna -p abund *.fq.gz
remove k-mers only present in 1 sample with filter-min-samples.py to create frakensignature file
split up frakensignature file into individual sig files with sourmash sig split {input} -gz -o outdir (I haven't figured out the smart way to do this in a snakefile yet because the output names are hashes and other info (example), do for now I defined my output as the directory that will be created

ctb · 2022-12-02T01:35:06Z

(very evocative names...)

you can actually do a lot of stuff with the frakensignature file, which is pretty snakemake friendly. Question is what you want to actually do :).

note that the sketches should still be named by whatever their sample name is; see sourmash sig describe <frakensignature.zip>. Then sourmash sig cat --include <name> -o <name>.sig.gz will split out the names etc.

(Basically, I try to not using named .sig files if I can avoid it, because the .zip is so much more convenient. But it might require retooling some of the specific workflow steps. not sure.)

ccbaumler · 2022-12-14T19:37:56Z

Wild idea. Thinking about Branchwater in relation to this technique. We have attempted to extract metagenome sequences from the SRA with random forest classified signatures. That was not as informative as we had hoped with that particular search.

Could we instead identify a set of metagenomes of a particular phenotype (e.g. IBD) and isolate the shared common kmers of that phenotype? Subtracting a control phenotype common kmer set from the case phenotype signature either before or after creating the case common kmer sig.

Edit: Or, in addition to subtracting control phenotype common hash sets, we could subtract unique cases from each ohter as well.
Edit2: I suppose you could call this identifying the core genomic components of phenotypic metagenome data

taylorreiter · 2022-12-14T20:05:46Z

So you're suggesting:

Find IBD metagenomes
Find other gut metagenomes from non-ibd data sets
from the IBD metagenomes, subtract hashes that are in signatures of the other gut metagenomes
do stuff with the leftovers

ccbaumler · 2022-12-14T20:13:23Z

4a. branchwater search using the core genomic components of each common hash set and compare the results

ctb · 2022-12-16T14:54:36Z

(I'm tempted to suggest we move this to a new issue, since it's diverging pretty far from the issue title - but we can do that later I guess ;)

I would be very interested also/separately in calculating Shannon entropy / information with respect to metadata. I have code 🤷

ctb · 2023-01-06T19:36:23Z

Trying this out with the CLI plugin infrastructure #2438 - see PR ctb/2022-sourmash-filter-min-samples#1.

Kinda neat - when all the machinery works, you get the ability to run:

% sourmash scripts commonhash ../sourmash/podar-ref/*.sig  -o xxx.sig

ctb · 2023-07-31T10:57:02Z

The code has now been moved from https://github.com/ctb/2022-sourmash-filter-min-samples to https://github.com/ctb/sourmash_plugin_commonhash.

Leaving this issue open because it has a lot of good discussion that we should put in advanced documentation or something.

mr-eyes · 2024-02-02T03:56:13Z

I think this fits here.
https://github.com/mr-eyes/sourmash_plugin_extract_errors

ctb added code herein lies code example labels Dec 1, 2022

ctb mentioned this issue Mar 1, 2023

Group-specific hashes from sourmash signature #2495

Closed

ctb mentioned this issue Sep 23, 2023

add functionality to sourmash signature to extract common hashes? #603

Open

ctb mentioned this issue Feb 5, 2024

plugin: sourmash_plugin_extract_errors #2976

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removing unique k-mers between samples #2383

Removing unique k-mers between samples #2383

jessicalumian commented Nov 30, 2022

ctb commented Nov 30, 2022

jessicalumian commented Nov 30, 2022

ctb commented Nov 30, 2022

ctb commented Nov 30, 2022

ctb commented Nov 30, 2022

ctb commented Nov 30, 2022

jessicalumian commented Dec 1, 2022

ctb commented Dec 1, 2022 •

edited

Loading

ctb commented Dec 1, 2022

ctb commented Dec 1, 2022

ctb commented Dec 1, 2022

taylorreiter commented Dec 1, 2022

jessicalumian commented Dec 1, 2022

ctb commented Dec 1, 2022

jessicalumian commented Dec 1, 2022

taylorreiter commented Dec 1, 2022

ccbaumler commented Dec 1, 2022

jessicalumian commented Dec 2, 2022 •

edited

Loading

ctb commented Dec 2, 2022

ccbaumler commented Dec 14, 2022 •

edited

Loading

taylorreiter commented Dec 14, 2022

ccbaumler commented Dec 14, 2022 •

edited

Loading

ctb commented Dec 16, 2022

ctb commented Jan 6, 2023

ctb commented Jul 31, 2023

mr-eyes commented Feb 2, 2024

Removing unique k-mers between samples #2383

Removing unique k-mers between samples #2383

Comments

jessicalumian commented Nov 30, 2022

ctb commented Nov 30, 2022

jessicalumian commented Nov 30, 2022

ctb commented Nov 30, 2022

what I did to verify that a.sig and c.sig are identical -

other tricks

ctb commented Nov 30, 2022

ctb commented Nov 30, 2022

ctb commented Nov 30, 2022

jessicalumian commented Dec 1, 2022

ctb commented Dec 1, 2022 • edited Loading

ctb commented Dec 1, 2022

ctb commented Dec 1, 2022

ctb commented Dec 1, 2022

taylorreiter commented Dec 1, 2022

jessicalumian commented Dec 1, 2022

ctb commented Dec 1, 2022

jessicalumian commented Dec 1, 2022

taylorreiter commented Dec 1, 2022

ccbaumler commented Dec 1, 2022

jessicalumian commented Dec 2, 2022 • edited Loading

ctb commented Dec 2, 2022

ccbaumler commented Dec 14, 2022 • edited Loading

taylorreiter commented Dec 14, 2022

ccbaumler commented Dec 14, 2022 • edited Loading

ctb commented Dec 16, 2022

ctb commented Jan 6, 2023

ctb commented Jul 31, 2023

mr-eyes commented Feb 2, 2024

what I did to verify that `a.sig` and `c.sig` are identical -

ctb commented Dec 1, 2022 •

edited

Loading

jessicalumian commented Dec 2, 2022 •

edited

Loading

ccbaumler commented Dec 14, 2022 •

edited

Loading

ccbaumler commented Dec 14, 2022 •

edited

Loading