-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MRG: enable singleton
sketching, facilitate reads
-based sketching
#184
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
bluegenes
commented
Jan 9, 2024
bluegenes
changed the title
WIP: manysketch improvements
MRG: enable Feb 27, 2024
singleton
sketching, facilitate reads
-based sketching
@ctb ready for review. |
ctb
reviewed
Feb 27, 2024
ctb
approved these changes
Feb 27, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, works great!
Some minimal documentation and an update to the help docstring would be lovely :). But punt to an issue if you prefer to add those later.
bluegenes
added a commit
that referenced
this pull request
Mar 1, 2024
## Prefix-based sketching #184 introduces a new input type that better supports metagenome reads, but doesn't really make things that much simpler for the power user. We can probably support prefix-style naming, as suggested in dib-lab/sourmash-slainte#11. Here we introduce a 'prefix' CSV type with the following columns: `name,input_moltype,prefix,exclude`. Here we: 1. glob to find all files that match prefix 2. glob to find all files that match exclude 3. filter prefix files to exclude `exclude` files This just uses `glob`, no `regex`, so `*` are fine in `prefix` and `exclude`, but not full regex patterns.
This was referenced Mar 1, 2024
This was referenced Mar 20, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Here I modify
manysketch
to allow:New Functions
detect_csv_type
- figure out which type of acceptable input csv we haveprocess_assembly_csv
- process standardfromfile
csv (headers:name,genome_filename,protein_filename
). This was originally part ofload_fasta_fromfile
.process_reads_csv
- process newreads
csv (headers:name,read1,read2
)This refactoring also makes it much simpler for us to add new types of CSVs that can be read.
Notes:
--singleton
, the name entry in each row doesn't end up getting used, since we name from the records themselves.1. Multiple input files for a single sketch
We can now sketch multiple filenames into the same sketch. This is currently only possible if you input a 'reads' csv, with format
name,read1,read2
, which is designed to add support for metagenomes with PE reads. Note that theread2
column name must be present, but can be empty. Since this is designed for metagenome reads, we assume input files are DNA.The original
sketch fromfile
input format,name,genome_filename,protein_filename
remains functional.Future work could introduce additional ways to pass more than one filename per sketch.
2. Singleton sketching
With either input filetype, we can now pass
--singleton
and get one sketch per record, named from the record name.Prefix-based sketching punted to #243