Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: enable singleton sketching, facilitate reads-based sketching #184

Merged
merged 11 commits into from
Feb 27, 2024

Conversation

bluegenes
Copy link
Contributor

@bluegenes bluegenes commented Jan 8, 2024

Here I modify manysketch to allow:

  • sketching multiple input files into the same sketch
  • sketching singletons!

New Functions

  • detect_csv_type - figure out which type of acceptable input csv we have
  • process_assembly_csv - process standard fromfile csv (headers: name,genome_filename,protein_filename). This was originally part of load_fasta_fromfile.
  • process_reads_csv - process new reads csv (headers: name,read1,read2)

This refactoring also makes it much simpler for us to add new types of CSVs that can be read.

Notes:

  • parallelization is still done by each set of inputs that should yield signatures. If singleton, we're sketching all records for a given file in series, not all sketches in parallel.
  • perhaps a confusing thing: when sketching with --singleton, the name entry in each row doesn't end up getting used, since we name from the records themselves.

1. Multiple input files for a single sketch

We can now sketch multiple filenames into the same sketch. This is currently only possible if you input a 'reads' csv, with format name,read1,read2, which is designed to add support for metagenomes with PE reads. Note that the read2 column name must be present, but can be empty. Since this is designed for metagenome reads, we assume input files are DNA.

The original sketch fromfile input format, name,genome_filename,protein_filename remains functional.

Future work could introduce additional ways to pass more than one filename per sketch.

2. Singleton sketching

With either input filetype, we can now pass --singleton and get one sketch per record, named from the record name.

Prefix-based sketching punted to #243

src/manysketch.rs Outdated Show resolved Hide resolved
@bluegenes bluegenes changed the title manysketch improvements WIP: manysketch improvements Jan 9, 2024
@bluegenes bluegenes changed the title WIP: manysketch improvements MRG: enable singleton sketching, facilitate reads-based sketching Feb 27, 2024
@bluegenes
Copy link
Contributor Author

@ctb ready for review.

Copy link
Collaborator

@ctb ctb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, works great!

Some minimal documentation and an update to the help docstring would be lovely :). But punt to an issue if you prefer to add those later.

@bluegenes bluegenes merged commit 9d9661b into main Feb 27, 2024
1 check passed
@bluegenes bluegenes deleted the ms-mgx branch February 27, 2024 17:57
@ctb ctb mentioned this pull request Feb 27, 2024
bluegenes added a commit that referenced this pull request Mar 1, 2024
## Prefix-based sketching

#184  introduces a new input type that better supports metagenome reads, but doesn't really make things that much simpler for the power user. We can probably support prefix-style naming, as suggested in dib-lab/sourmash-slainte#11.

Here we introduce a 'prefix' CSV type with the following columns:
`name,input_moltype,prefix,exclude`.

Here we:
1. glob to find all files that match prefix
2. glob to find all files that match exclude
3. filter prefix files to exclude `exclude` files

This just uses `glob`, no `regex`, so `*` are fine in `prefix` and `exclude`, but not full regex patterns.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants