-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve recovery after failures #69
Comments
related: sourmash-bio/database-releases#7 |
bluegenes
changed the title
improve recovery after failures by writing sigs to temp dir
improve recovery after failures
Oct 4, 2024
7 tasks
bluegenes
added a commit
that referenced
this issue
Oct 4, 2024
This PR introduces a new optional param `--batch-size`, which allows users to build smaller zipfiles with `gbsketch` or `urlsketch`. These zipfiles are populated sequentially, with all signatures associated with `batch_size` accessions (not `batch_size` signatures). If `gbsketch`/`urlsketch` fail, they can read any zipfiles that were finished in order to restart. Zip names will be generated from the `--output`, so if output is `output.zip`, batches will be `output.1.zip`, `output.2.zip`, etc. I'm not really sure what `batch_size` to recommend, but I think the overhead is fairly low for creating new small zips -- the main issue will be if users later want to concatenate them into a single zip. Uses the changes from #101 to enable writing batched zipfiles as a way to improve restart. - [x] make batch_size a user modifiable parameter - [x] For cases where the total number of signatures is less than the `batch_size`, we could write the regular `*zip` file, with no `.1`, etc. - [x] functions to enable reading from existing batched zips to allow restart - [x] build filename: paramset Hashmap, use that to filter the template sigs for each filename using `filter` - [x] add tests for batched zipfile writing, recovery from existing batches - [x] move zip_writer creation inside writing loop to avoid empty final zip - [x] check what happens if we have an unclosed zip (i.e. from unexpected failure) - **sourmash panics on invalid zips. Here I've caught the panic and ignored it**, but it may ultimately be better to handle + return error at the sourmash level (`ZipStorage::from_file` panics) - Note that we will likely have an invalid zip upon any restart from failure, because the zip file would not have properly been closed/finished. Issue for later: - #107 Fixes: - #69
fixed by #102 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Currently, if
directsketch
fails for whatever reason during download+sketch, already-sketched files are unusable, because they're part of an unfinished zip file. However, we're not actually usingzip
for any compression here -- sigs are gz compressed themselves and then just stored in the zip.Instead of writing directly to a zip file, we could write sigs to a temp directory (provide
--temp-dir
option for naming?), which would be readable upon any failure. We could optionally write manifests in chunks to make loading simpler. After sketching, we could move the files into a zip, combine the manifests, and finish the zip file. I'm not sure how much extra time this last bit would take, but likely worth it to allow recovery.For recovery after failure / use of temp sketches, we would first look in the
--temp-dir
for any preexisting sketches and just avoid re-calculating those.The text was updated successfully, but these errors were encountered: