Selectively re-run extractions feature #1071

rogerhutchings · 2020-01-21T12:29:27Z

Related to #1070, adding the ability to selectively re-run extractors might be useful, or to prevent an extractor from being re-run when "Re-run extractors" is hit.

Use case: I'm using an external extractor to hit an API that'll request a cloud platform pays its workers on successful classification. If I had another for use in aggregation / retirement, and I need to re-run it (for whatever reason), re-running everything will presumably spam my payment API as well?

camallen · 2020-01-22T17:28:13Z

The whole idea of re-running extractors is to ensure we do extract the data from the classification as per the extractor config and thus the downstream reducers re-run as well.

I can see there being needs to avoid re-running reducers etc but avoiding a re-run / call to the external service is not what this service was designed to do. The code could reflect on a workflow configuration option and not re-run the extractor (or downstream reductions etc) but it would be very low priority for feature development on caesar.

rogerhutchings · 2020-01-22T17:37:19Z

I figured that would be the case. It sounds like while it works, I'd be better off moving this out of Caesar, since what we're actually doing is generating classification side-effects and breaking idempotence as a result.

Having another app consuming the Kinesis stream might be a better idea, if we can come up with cross-account access to it; I don't want to put stuff on Zoo infrastructure unnecessarily

camallen · 2020-01-22T17:57:12Z

Or an alternative setup where you use caesar as your event publisher for classification data. You then build an API gateway style system backed with a redis store (for atomic transactions https://redis.io/topics/transactions) and you only propgate the classification event if you have no record of the classification id in your store.

This would avoid you building a coupled (kinesis) stream reader as well and focus on your idempotent logic on replay events from the publisher (caesar). Thoughts?

camallen · 2020-01-22T17:58:11Z

as for zoo infrastructure, short term this isn't an issue and i don't think the cost here is very high, longer term can be revisited.

rogerhutchings added the enhancement label Jan 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Selectively re-run extractions feature #1071

Selectively re-run extractions feature #1071

rogerhutchings commented Jan 21, 2020 •

edited

Loading

camallen commented Jan 22, 2020

rogerhutchings commented Jan 22, 2020

camallen commented Jan 22, 2020

camallen commented Jan 22, 2020

Selectively re-run extractions feature #1071

Selectively re-run extractions feature #1071

Comments

rogerhutchings commented Jan 21, 2020 • edited Loading

camallen commented Jan 22, 2020

rogerhutchings commented Jan 22, 2020

camallen commented Jan 22, 2020

camallen commented Jan 22, 2020

rogerhutchings commented Jan 21, 2020 •

edited

Loading