Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrade manysketch to support multiple files per sketch? #169

Closed
ctb opened this issue Dec 24, 2023 · 3 comments
Closed

upgrade manysketch to support multiple files per sketch? #169

ctb opened this issue Dec 24, 2023 · 3 comments

Comments

@ctb
Copy link
Collaborator

ctb commented Dec 24, 2023

I am playing around with ways to more easily support sketching multiple data files into one sketch, over in slainte - dib-lab/sourmash-slainte#2 - and it would be nice to to have manysketch support something similar.

Hmm, an intriguing alternative would be to have manysketch remain "simple" - turn single data files into sketches - but then support more robust combining of sketches after the initial sketching, either in slainte or in sourmash.

@ctb
Copy link
Collaborator Author

ctb commented Jan 2, 2024

dib-lab/sourmash-slainte#7 was completed using sig merge which is nice, but overly complex.

Maybe we could support a JSON file => manysketch?

@bluegenes
Copy link
Contributor

  • MRG: enable singleton sketching, facilitate reads-based sketching #184 introduced multiple files per sketch and multiple sketches per file (via singleton).

    • With this PR, the only way to use multiple files per sketch was to use the reads csv input format, which allows columns name,read1,read2 and sketched both read1 and read2 fasta into the same sketch. However, the refactoring done here made it much simpler to add new input csv types.
  • MRG: fix clippy warnings from manysketch improvements #251 simplified a little by introducing a FastaData struct to hold the information for sketching.

    pub struct FastaData {
        pub name: String,
        pub paths: Vec<PathBuf>,
        pub input_type: String,
    
  • MRG: support prefix csv input for manysketch #243 introduced a new input csv type, prefix, which requires columns: name,input_moltype,prefix,exclude, where we use glob to find files that match prefix but do not match exclude.

    • This is a much more flexible way to specify several input files per sketch. We track the FASTA files used across all rows, and, by default, error out if a file is used more than once. --force allows us to ignore duplication.
  • Other information of note:

    • All csv reader functions ignore duplicated rows (rows where all information is duplicated) and notify the user of the number of duplicate rows seen.
    • singleton logic is used during sketching (i.e. after reading from csv). If --singleton is enabled, we produce one sketch per record in each fasta, regardless of how they were read in. This means that the name from the input csv is not currently used. We could think about prepending the name or something, I don't know what would be desirable.
    • --force is not yet used when reading fromfile or reads csvs.

I think this addresses everything here? We can now (somewhat easily) introduce more ways of reading in multiple files: the csv has to have a defined set of required column names, and the reader function needs to return a vec of FastaData and a total number of fastas, which is used for progress reporting.

@ctb
Copy link
Collaborator Author

ctb commented Mar 1, 2024

Nice work!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants