Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop making microbatch batches with filters that will never have any rows #10826

Merged
merged 4 commits into from
Oct 8, 2024

Conversation

QMalcolm
Copy link
Contributor

@QMalcolm QMalcolm commented Oct 4, 2024

Resolves #10824

Our logic previously created batches for each batch period where batch_start <= event_end_time. This was problematic when a batch_start equaled the event_end_time because a batch would be produced with the filter like WHERE event_time >= '2024-01-01 00:00:00' AND event_time < '2024-01-01 00:00:00'. The two statements in that filter would logicially exclude each other meaning that 0 rows would be selected always. Thus we've changed the batch creation logic to be batch_start < event_end_time (as opposed to <=), which stops the bad batch filter from being a possibility.

Checklist

  • I have read the contributing guide and understand what's expected of me.
  • I have run this code in development, and it appears to resolve the stated issue.
  • This PR includes tests, or tests are not required or relevant for this PR.
  • This PR has no interface changes (e.g., macros, CLI, logs, JSON artifacts, config files, adapter interface, etc.) or this PR has already received feedback and approval from Product or DX.
  • This PR includes type annotations for new and modified functions.

…rows

Our logic prevously created batches for each batch period where
batch_start <= event_end_time. This was problematic when a batch_start
equaled the event_end_time because a batch would be produced with the filter
like `WHERE event_time >= '2024-01-01 00:00:00' AND event_time < '2024-01-01 00:00:00'`.
The two statements in that filter would logicially exclude each other meaning that
0 rows would be selected _always_. Thus we've changed the batch creation logic
to be batch_start `<` event_end_time (as opposed to `<=`), which stops the
bad batch filter from being a possibility.
@QMalcolm QMalcolm added the Skip Changelog Skips GHA to check for changelog file label Oct 4, 2024
@cla-bot cla-bot bot added the cla:yes label Oct 4, 2024
Copy link

codecov bot commented Oct 4, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.18%. Comparing base (6b9c1da) to head (4e3c4b4).
Report is 5 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10826      +/-   ##
==========================================
- Coverage   89.20%   89.18%   -0.02%     
==========================================
  Files         183      183              
  Lines       23402    23419      +17     
==========================================
+ Hits        20875    20886      +11     
- Misses       2527     2533       +6     
Flag Coverage Δ
integration 86.40% <75.00%> (-0.11%) ⬇️
unit 62.11% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
Unit Tests 62.11% <100.00%> (-0.01%) ⬇️
Integration Tests 86.40% <75.00%> (-0.11%) ⬇️

@QMalcolm
Copy link
Contributor Author

QMalcolm commented Oct 4, 2024

One outstanding question is...

what should happen if you have a lookback=2, batch_size=day, an --event-time-end "2024-05-26", and don't specify an --event-time-start? Would you expect:

  1. 3 batches
    a. 2024-05-23 00:00:00 <= event_time < 2024-05-24 00:00:00
    b. 2024-05-24 00:00:00 <= event_time < 2024-05-25 00:00:00
    c. 2024-05-25 00:00:00 <= event_time < 2024-05-26 00:00:00

  2. 2 batches
    a. 2024-05-24 00:00:00 <= event_time < 2024-05-25 00:00:00
    b. 2024-05-25 00:00:00 <= event_time < 2024-05-26 00:00:00

With the change made in 30c9aea we are currently doing (2). If we want (1) to happen we'll need to make a change to the start time calculation here. Basically we'd do something like

if MicrobatchBuilder.truncate_timestamp(checkpoint, batch_size) == checkpoint:
  lookback += 1

start = MicrobatchBuilder.offset_timestamp(checkpoint, batch_size, -1 * lookback)

@QMalcolm QMalcolm removed the Skip Changelog Skips GHA to check for changelog file label Oct 4, 2024
@QMalcolm
Copy link
Contributor Author

QMalcolm commented Oct 8, 2024

One outstanding question is...

what should happen if you have a lookback=2, batch_size=day, an --event-time-end "2024-05-26", and don't specify an --event-time-start?

Thinking about it more, I think we should go with option (1) that I laid out. That being three batches:

  1. 2024-05-23 00:00:00 <= event_time < 2024-05-24 00:00:00
  2. 2024-05-24 00:00:00 <= event_time < 2024-05-25 00:00:00
  3. 2024-05-25 00:00:00 <= event_time < 2024-05-26 00:00:00

The reason I think we should go that direction is because I asked myself what I would expect to happen if an automated job kicked off that model 2024-05-26 00:00:00. In that scenario I would expect it to run the previous day 2024-05-25 00:00:00 <= event_time < 2024-05-26 00:00:00 and then. the prior two days as well. If that is what we'd expect to happen in that scenario, then it is reasonable to extend that expectation if one specifies that timestamp via --event-time-end.

… that equal the checkpoint

Previously if the checkpoint provided to `build_start_time` was at the truncation point
for the related batch size, then the returned `start_time` would be _the same_ as the checkpoint.
For example if the checkpoint was "2024-09-05 00:00:00", and the batch size was `day`, then the
returend `start_time` would be "2024-09-05 00:00:00". This is problematic because then the would
be no batch created when running `build_batches`. Or, prior to 12bb2ca, you'd get one batch with
a filter like `event_time >= 2024-09-05 00:00:00 AND event_time < 2024-09-05 00:00:00` which is
impossible to satisfy.

The change in this PR makes it so that if the checkpoint is at the truncation point, then the start
time will be guaranteed to move back by one batch period. That is, following the same example,
"2024-09-04 00:00:00" would be returned.
@QMalcolm QMalcolm marked this pull request as ready for review October 8, 2024 22:10
@QMalcolm QMalcolm requested a review from a team as a code owner October 8, 2024 22:10
@QMalcolm QMalcolm merged commit f6cdacc into main Oct 8, 2024
62 of 64 checks passed
@QMalcolm QMalcolm deleted the qmalcolm--10824-better-batching-with-event-time-end branch October 8, 2024 23:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better batching when using --event-time-end
2 participants