-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
copy batched zips to a single zipfile? #107
Comments
7 tasks
bluegenes
added a commit
that referenced
this issue
Oct 4, 2024
This PR introduces a new optional param `--batch-size`, which allows users to build smaller zipfiles with `gbsketch` or `urlsketch`. These zipfiles are populated sequentially, with all signatures associated with `batch_size` accessions (not `batch_size` signatures). If `gbsketch`/`urlsketch` fail, they can read any zipfiles that were finished in order to restart. Zip names will be generated from the `--output`, so if output is `output.zip`, batches will be `output.1.zip`, `output.2.zip`, etc. I'm not really sure what `batch_size` to recommend, but I think the overhead is fairly low for creating new small zips -- the main issue will be if users later want to concatenate them into a single zip. Uses the changes from #101 to enable writing batched zipfiles as a way to improve restart. - [x] make batch_size a user modifiable parameter - [x] For cases where the total number of signatures is less than the `batch_size`, we could write the regular `*zip` file, with no `.1`, etc. - [x] functions to enable reading from existing batched zips to allow restart - [x] build filename: paramset Hashmap, use that to filter the template sigs for each filename using `filter` - [x] add tests for batched zipfile writing, recovery from existing batches - [x] move zip_writer creation inside writing loop to avoid empty final zip - [x] check what happens if we have an unclosed zip (i.e. from unexpected failure) - **sourmash panics on invalid zips. Here I've caught the panic and ignored it**, but it may ultimately be better to handle + return error at the sourmash level (`ZipStorage::from_file` panics) - Note that we will likely have an invalid zip upon any restart from failure, because the zip file would not have properly been closed/finished. Issue for later: - #107 Fixes: - #69
or could just provide command for |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
There may be some speed/performance benefit to an async version of
sig cat
? After introducing batched zip with #102, we could offer azipcat
or similar to copy all batched zips into a single zip.With #102, we have a number of utils that would simplify this:
find_existing_zip_batches
functionMultiCollection
that can load zipfilesasync_write_sigs_to_zip
inBuildCollection
BuildManifest
withasync_write_manifest_to_zip
The text was updated successfully, but these errors were encountered: