Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selectively re-run extractions feature #1071

Open
rogerhutchings opened this issue Jan 21, 2020 · 4 comments
Open

Selectively re-run extractions feature #1071

rogerhutchings opened this issue Jan 21, 2020 · 4 comments

Comments

@rogerhutchings
Copy link

rogerhutchings commented Jan 21, 2020

Related to #1070, adding the ability to selectively re-run extractors might be useful, or to prevent an extractor from being re-run when "Re-run extractors" is hit.

Use case: I'm using an external extractor to hit an API that'll request a cloud platform pays its workers on successful classification. If I had another for use in aggregation / retirement, and I need to re-run it (for whatever reason), re-running everything will presumably spam my payment API as well?

@camallen
Copy link
Contributor

The whole idea of re-running extractors is to ensure we do extract the data from the classification as per the extractor config and thus the downstream reducers re-run as well.

I can see there being needs to avoid re-running reducers etc but avoiding a re-run / call to the external service is not what this service was designed to do. The code could reflect on a workflow configuration option and not re-run the extractor (or downstream reductions etc) but it would be very low priority for feature development on caesar.

@rogerhutchings
Copy link
Author

I figured that would be the case. It sounds like while it works, I'd be better off moving this out of Caesar, since what we're actually doing is generating classification side-effects and breaking idempotence as a result.

Having another app consuming the Kinesis stream might be a better idea, if we can come up with cross-account access to it; I don't want to put stuff on Zoo infrastructure unnecessarily

@camallen
Copy link
Contributor

Or an alternative setup where you use caesar as your event publisher for classification data. You then build an API gateway style system backed with a redis store (for atomic transactions https://redis.io/topics/transactions) and you only propgate the classification event if you have no record of the classification id in your store.

This would avoid you building a coupled (kinesis) stream reader as well and focus on your idempotent logic on replay events from the publisher (caesar). Thoughts?

@camallen
Copy link
Contributor

as for zoo infrastructure, short term this isn't an issue and i don't think the cost here is very high, longer term can be revisited.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants