-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop making microbatch batches with filters that will never have any rows #10826
Stop making microbatch batches with filters that will never have any rows #10826
Conversation
…rows Our logic prevously created batches for each batch period where batch_start <= event_end_time. This was problematic when a batch_start equaled the event_end_time because a batch would be produced with the filter like `WHERE event_time >= '2024-01-01 00:00:00' AND event_time < '2024-01-01 00:00:00'`. The two statements in that filter would logicially exclude each other meaning that 0 rows would be selected _always_. Thus we've changed the batch creation logic to be batch_start `<` event_end_time (as opposed to `<=`), which stops the bad batch filter from being a possibility.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #10826 +/- ##
==========================================
- Coverage 89.20% 89.18% -0.02%
==========================================
Files 183 183
Lines 23402 23419 +17
==========================================
+ Hits 20875 20886 +11
- Misses 2527 2533 +6
Flags with carried forward coverage won't be shown. Click here to find out more.
|
One outstanding question is... what should happen if you have a
With the change made in 30c9aea we are currently doing (2). If we want (1) to happen we'll need to make a change to the start time calculation here. Basically we'd do something like if MicrobatchBuilder.truncate_timestamp(checkpoint, batch_size) == checkpoint:
lookback += 1
start = MicrobatchBuilder.offset_timestamp(checkpoint, batch_size, -1 * lookback) |
Thinking about it more, I think we should go with option (1) that I laid out. That being three batches:
The reason I think we should go that direction is because I asked myself what I would expect to happen if an automated job kicked off that model |
…kpoint straddles the batch size
… that equal the checkpoint Previously if the checkpoint provided to `build_start_time` was at the truncation point for the related batch size, then the returned `start_time` would be _the same_ as the checkpoint. For example if the checkpoint was "2024-09-05 00:00:00", and the batch size was `day`, then the returend `start_time` would be "2024-09-05 00:00:00". This is problematic because then the would be no batch created when running `build_batches`. Or, prior to 12bb2ca, you'd get one batch with a filter like `event_time >= 2024-09-05 00:00:00 AND event_time < 2024-09-05 00:00:00` which is impossible to satisfy. The change in this PR makes it so that if the checkpoint is at the truncation point, then the start time will be guaranteed to move back by one batch period. That is, following the same example, "2024-09-04 00:00:00" would be returned.
Resolves #10824
Our logic previously created batches for each batch period where batch_start <= event_end_time. This was problematic when a batch_start equaled the event_end_time because a batch would be produced with the filter like
WHERE event_time >= '2024-01-01 00:00:00' AND event_time < '2024-01-01 00:00:00'
. The two statements in that filter would logicially exclude each other meaning that 0 rows would be selected always. Thus we've changed the batch creation logic to be batch_start<
event_end_time (as opposed to<=
), which stops the bad batch filter from being a possibility.Checklist