-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removing unique k-mers between samples #2383
Comments
That code looks good to me! Note that the output is a text file with just the relevant hashes in it (and not their counts); if you need a different format, e.g. something suitable for use with |
Ok cool! Because this is dealing with hashes, does that mean that k-mer abundance information is retained? Or would that require I'm not totally clear on how Each sample has its own signature file. sample_A.sig has a hash is specific to the abundance of a k-mer. If sample_B.sig has that same hash, then |
Let me see if this response helps - ask away if not :) when signatures are calculated with
An exercise you might try - what I did to verify that
|
it's not hard to change the code in @taylorreiter Snakefile to output an abundance sig. I'll do that when I get a chance! |
Check this 'un out - |
(some things to be done to improve, but it's a start!) |
That explanation helps a lot!! I'm trying to picture a use case for Thank you for the python code!!! sneaking in a snakemake question: the best practice would be to have this script in the directory with my snakefile and call it in a rule as opposed to putting this code directly in the snakefile, right? |
Basically there are weird things power users sometimes want to do and we're trying to enable them in a sensible way where common things are actually available via the sourmash command line, and thus power users only need to do very oddball things via scripting. tl;dr no one clear use case, but many fuzzy ones, so we supported it :)
yep! it's not always clear it matters but if it starts as a separate script then might as well keep it that way! |
Note to self: |
(Updated the git repo with comments and README.) |
Thinking of integrating this into |
I love the idea of integrating this into @jessicalumian -- I used the code you linked above to create a signature that had all of the hashes that I was interested in keeping. Then, I took each of my metagenome signatures and used |
I ran the code from Titus and I have 1 giant signature of everything I want to keep from all my samples (woo!). Because I'm never out of questions: If I use Does it make sense to retain abundances for samples that have unique k-mers removed? It would be nice to see if microbe X is highly abundant in one sample and low in another. It seems like the removal of unique k-mers wouldn't affect this comparison because abundance was calculated from the original metagenome, right? |
so I was worried this would happen 😆 you should actually have one gigantic file containing many signatures! Try running You can use |
ooooh ok I might actually have 1 file with multiple signatures! This is the output of
This would indicate that I actually have 1 file containing my 20 signatures (I had 20 input files in this case), right? After running The names have a hash and some stuff (like the example here), but I can rename them with Sourmash: There's a command for that. 💪 ✨ |
(this is so awesome) |
For clarity, this workflow is:
Correct? |
yep! then
|
(very evocative names...) you can actually do a lot of stuff with the frakensignature file, which is pretty snakemake friendly. Question is what you want to actually do :). note that the sketches should still be named by whatever their sample name is; see (Basically, I try to not using named .sig files if I can avoid it, because the .zip is so much more convenient. But it might require retooling some of the specific workflow steps. not sure.) |
Wild idea. Thinking about Branchwater in relation to this technique. We have attempted to extract metagenome sequences from the SRA with random forest classified signatures. That was not as informative as we had hoped with that particular search. Could we instead identify a set of metagenomes of a particular phenotype (e.g. IBD) and isolate the shared common kmers of that phenotype? Subtracting a control phenotype common kmer set from the case phenotype signature either before or after creating the case common kmer sig. Edit: Or, in addition to subtracting control phenotype common hash sets, we could subtract unique cases from each ohter as well. |
So you're suggesting:
|
4a. branchwater search using the core genomic components of each common hash set and compare the results |
(I'm tempted to suggest we move this to a new issue, since it's diverging pretty far from the issue title - but we can do that later I guess ;) I would be very interested also/separately in calculating Shannon entropy / information with respect to metadata. I have code 🤷 |
Trying this out with the CLI plugin infrastructure #2438 - see PR ctb/2022-sourmash-filter-min-samples#1. Kinda neat - when all the machinery works, you get the ability to run:
|
The code has now been moved from https://github.com/ctb/2022-sourmash-filter-min-samples to https://github.com/ctb/sourmash_plugin_commonhash. Leaving this issue open because it has a lot of good discussion that we should put in advanced documentation or something. |
I think this fits here. |
Hello! I am working on replicating some (amazing) work from Taylor's IBD project.
I have signatures from
sourmash sketch dna
of metagenome samples. I would like to remove k-mers that are only present in 1 sample with the hope this will reduce noise in comparison between samples.It looks like Taylor does this here.
Would it be recommended to phagocytize this code, or write something else? Thank you!
The text was updated successfully, but these errors were encountered: