-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Selectively re-run extractions feature #1071
Comments
The whole idea of re-running extractors is to ensure we do extract the data from the classification as per the extractor config and thus the downstream reducers re-run as well. I can see there being needs to avoid re-running reducers etc but avoiding a re-run / call to the external service is not what this service was designed to do. The code could reflect on a workflow configuration option and not re-run the extractor (or downstream reductions etc) but it would be very low priority for feature development on caesar. |
I figured that would be the case. It sounds like while it works, I'd be better off moving this out of Caesar, since what we're actually doing is generating classification side-effects and breaking idempotence as a result. Having another app consuming the Kinesis stream might be a better idea, if we can come up with cross-account access to it; I don't want to put stuff on Zoo infrastructure unnecessarily |
Or an alternative setup where you use caesar as your event publisher for classification data. You then build an API gateway style system backed with a redis store (for atomic transactions https://redis.io/topics/transactions) and you only propgate the classification event if you have no record of the classification id in your store. This would avoid you building a coupled (kinesis) stream reader as well and focus on your idempotent logic on replay events from the publisher (caesar). Thoughts? |
as for zoo infrastructure, short term this isn't an issue and i don't think the cost here is very high, longer term can be revisited. |
Related to #1070, adding the ability to selectively re-run extractors might be useful, or to prevent an extractor from being re-run when "Re-run extractors" is hit.
Use case: I'm using an external extractor to hit an API that'll request a cloud platform pays its workers on successful classification. If I had another for use in aggregation / retirement, and I need to re-run it (for whatever reason), re-running everything will presumably spam my payment API as well?
The text was updated successfully, but these errors were encountered: