Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
MRG: improve restart by optionally writing batched zipfiles (#102)
This PR introduces a new optional param `--batch-size`, which allows users to build smaller zipfiles with `gbsketch` or `urlsketch`. These zipfiles are populated sequentially, with all signatures associated with `batch_size` accessions (not `batch_size` signatures). If `gbsketch`/`urlsketch` fail, they can read any zipfiles that were finished in order to restart. Zip names will be generated from the `--output`, so if output is `output.zip`, batches will be `output.1.zip`, `output.2.zip`, etc. I'm not really sure what `batch_size` to recommend, but I think the overhead is fairly low for creating new small zips -- the main issue will be if users later want to concatenate them into a single zip. Uses the changes from #101 to enable writing batched zipfiles as a way to improve restart. - [x] make batch_size a user modifiable parameter - [x] For cases where the total number of signatures is less than the `batch_size`, we could write the regular `*zip` file, with no `.1`, etc. - [x] functions to enable reading from existing batched zips to allow restart - [x] build filename: paramset Hashmap, use that to filter the template sigs for each filename using `filter` - [x] add tests for batched zipfile writing, recovery from existing batches - [x] move zip_writer creation inside writing loop to avoid empty final zip - [x] check what happens if we have an unclosed zip (i.e. from unexpected failure) - **sourmash panics on invalid zips. Here I've caught the panic and ignored it**, but it may ultimately be better to handle + return error at the sourmash level (`ZipStorage::from_file` panics) - Note that we will likely have an invalid zip upon any restart from failure, because the zip file would not have properly been closed/finished. Issue for later: - #107 Fixes: - #69
- Loading branch information