Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve recovery after failures #69

Closed
bluegenes opened this issue Jul 15, 2024 · 3 comments
Closed

improve recovery after failures #69

bluegenes opened this issue Jul 15, 2024 · 3 comments

Comments

@bluegenes
Copy link
Collaborator

Currently, if directsketch fails for whatever reason during download+sketch, already-sketched files are unusable, because they're part of an unfinished zip file. However, we're not actually using zip for any compression here -- sigs are gz compressed themselves and then just stored in the zip.

Instead of writing directly to a zip file, we could write sigs to a temp directory (provide --temp-dir option for naming?), which would be readable upon any failure. We could optionally write manifests in chunks to make loading simpler. After sketching, we could move the files into a zip, combine the manifests, and finish the zip file. I'm not sure how much extra time this last bit would take, but likely worth it to allow recovery.

For recovery after failure / use of temp sketches, we would first look in the --temp-dir for any preexisting sketches and just avoid re-calculating those.

@ctb
Copy link
Contributor

ctb commented Aug 25, 2024

related: sourmash-bio/database-releases#7

@bluegenes bluegenes changed the title improve recovery after failures by writing sigs to temp dir improve recovery after failures Oct 4, 2024
@bluegenes
Copy link
Collaborator Author

@ctb suggested an alternate idea - writing smaller zipfiles. #102 implements this.

bluegenes added a commit that referenced this issue Oct 4, 2024
This PR introduces a new optional param `--batch-size`, which allows
users to build smaller zipfiles with `gbsketch` or `urlsketch`. These
zipfiles are populated sequentially, with all signatures associated with
`batch_size` accessions (not `batch_size` signatures). If
`gbsketch`/`urlsketch` fail, they can read any zipfiles that were
finished in order to restart. Zip names will be generated from the
`--output`, so if output is `output.zip`, batches will be
`output.1.zip`, `output.2.zip`, etc. I'm not really sure what
`batch_size` to recommend, but I think the overhead is fairly low for
creating new small zips -- the main issue will be if users later want to
concatenate them into a single zip.

Uses the changes from #101 to enable writing batched zipfiles as a way
to improve restart.

- [x] make batch_size a user modifiable parameter
- [x] For cases where the total number of signatures is less than the
`batch_size`, we could write the regular `*zip` file, with no `.1`, etc.
- [x] functions to enable reading from existing batched zips to allow
restart
- [x] build filename: paramset Hashmap, use that to filter the template
sigs for each filename using `filter`
- [x] add tests for batched zipfile writing, recovery from existing
batches
- [x] move zip_writer creation inside writing loop to avoid empty final
zip
- [x] check what happens if we have an unclosed zip (i.e. from
unexpected failure)
- **sourmash panics on invalid zips. Here I've caught the panic and
ignored it**, but it may ultimately be better to handle + return error
at the sourmash level (`ZipStorage::from_file` panics)
- Note that we will likely have an invalid zip upon any restart from
failure, because the zip file would not have properly been
closed/finished.


Issue for later: 
- #107 

Fixes:
- #69
@bluegenes
Copy link
Collaborator Author

fixed by #102

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants