Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

copy batched zips to a single zipfile? #107

Open
bluegenes opened this issue Oct 1, 2024 · 1 comment
Open

copy batched zips to a single zipfile? #107

bluegenes opened this issue Oct 1, 2024 · 1 comment

Comments

@bluegenes
Copy link
Collaborator

There may be some speed/performance benefit to an async version of sig cat? After introducing batched zip with #102, we could offer a zipcat or similar to copy all batched zips into a single zip.

With #102, we have a number of utils that would simplify this:

  • find_existing_zip_batches function
  • super basic impl of MultiCollection that can load zipfiles
  • async_write_sigs_to_zip in BuildCollection
  • extendable BuildManifest with async_write_manifest_to_zip
bluegenes added a commit that referenced this issue Oct 4, 2024
This PR introduces a new optional param `--batch-size`, which allows
users to build smaller zipfiles with `gbsketch` or `urlsketch`. These
zipfiles are populated sequentially, with all signatures associated with
`batch_size` accessions (not `batch_size` signatures). If
`gbsketch`/`urlsketch` fail, they can read any zipfiles that were
finished in order to restart. Zip names will be generated from the
`--output`, so if output is `output.zip`, batches will be
`output.1.zip`, `output.2.zip`, etc. I'm not really sure what
`batch_size` to recommend, but I think the overhead is fairly low for
creating new small zips -- the main issue will be if users later want to
concatenate them into a single zip.

Uses the changes from #101 to enable writing batched zipfiles as a way
to improve restart.

- [x] make batch_size a user modifiable parameter
- [x] For cases where the total number of signatures is less than the
`batch_size`, we could write the regular `*zip` file, with no `.1`, etc.
- [x] functions to enable reading from existing batched zips to allow
restart
- [x] build filename: paramset Hashmap, use that to filter the template
sigs for each filename using `filter`
- [x] add tests for batched zipfile writing, recovery from existing
batches
- [x] move zip_writer creation inside writing loop to avoid empty final
zip
- [x] check what happens if we have an unclosed zip (i.e. from
unexpected failure)
- **sourmash panics on invalid zips. Here I've caught the panic and
ignored it**, but it may ultimately be better to handle + return error
at the sourmash level (`ZipStorage::from_file` panics)
- Note that we will likely have an invalid zip upon any restart from
failure, because the zip file would not have properly been
closed/finished.


Issue for later: 
- #107 

Fixes:
- #69
@bluegenes
Copy link
Collaborator Author

or could just provide command for sig collect to create a manifest from the batched zips

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant