-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: experiment with SMJ last buffered batch #12082
Conversation
) | ||
.run_test(&[JoinTestType::HjSmj], false) | ||
.await | ||
for i in 0..1000 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if I run this test 1000 times there is a possibility that test gonna fail.
// Try to calculate if the buffered batch we scan is the last one for specific stream row and join key | ||
// for Batchsize == 1 self.buffered_data.scanning_finished() works well | ||
// For other scenarios its an attempt to figure out there is no more rows matching the same join key | ||
let last_batch = if self.batch_size == 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@korowa @viirya @alamb
Hi guys appreciate if you have any other ideas how to calculate the last batch.
AntiJoin relies exactly on having the last batch, to calculate the predicate for join key correctly.
I'm trying to figure out there is no more buffered rows incoming for the given streaming join key. The approach is still not perfect as it still allows the tests to fail time to time although it becomes more stable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try and give it a look tomorrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect the right check, based on function names, to be
let last_batch = self.buffered_data.scanning_finished()
However, I tried that and the test still fails.
@richox I wonder if you have any ideas (as it appears you are the original author of SortMergeJoin in #2242)
I am having a hard time following the logic in such a large function (looks like freeze_streamed
is something like 300
lines long).
If I were debugging this issue more, what I would probably do is
- to break the logic down into a few more named functions so the logic boundaries were clearer and the intended action is clearer.
- try and document, in comments, what the intended invariants of BufferedBatch / ScanningBatch are. My hope would be that in the process of writing that documentation I would learn the code more so I could have a better idea of what invariant isn't being upheld in this ufunction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation would def help. Btw here a ticket for SMJ documentation
#10357
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm starting to look at this PR (will take some time though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm starting to look at this PR (will take some time though)
Thanks, its not an actual PR, it is more attempts/directions to find a solution and discuss. Im experimenting more in parallel and would love to hear your ideas as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've finally got your idea (and the fact that the problem is not related to fetching buffered side, but to processing already joined buffered side data). Probably for anti join following will be helpful
get_filtered_join_mask
-- maybe it should only updatematched_indices
(in case filters are evaluated astrue
as least once), and data emission logic should be in some other place (currently there is a problem with streamed records without any filter matches will be duplicated for each joined buffered chunk, as "negative" filter results are not tracked across joined batches). Anyway, it s output doesn't seem to be sufficient for antijoin.- filtered anti join should return only the records for which buffered-side scanning is completed (as
freeze_streamed
may be called in the middle of buffered-data scanning, due to output batch size), and there were notrue
filters for them (from p.1) -- so, maybe we should split filter evaluation and output emission infreeze_streamed
(since the filters should be checked for all matched indices, but in the same time, the current streamed index can be filitered out of output because it has further buffered batches to be joined with)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @korowa for the directions. This week I will try to find if such approach works for us, and alternatively I'm planning to play with a pair scanning_batch().range.start
and self.scanning_offset
perhaps it can give a hint how to identify last joined buffered side batch for the for the streaming row.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option just came to my mind. We know in advance number of matched rows. We can calc it as
let matched_indices = &self.buffered_data.batches.iter().map(|b| b.range.end - b.range.start).count();
No matter how SMJ distributes data, we can just take number of buffered_indices.len()
per each iteration and substract it from matched_indices
. Once we have hit 0 means no more matched rows expected
To create a reproduce test its needed to run a test in debug mode
the test creates a dump of data locally to the disk. for example The test below is a reproduce case(just set the paths) which step 1 outputs
|
@korowa @viirya please help to understand scenario with ranges. if there is a left streamed row with join key (1) from the right side we gonna have joined buffered batches where range shows what indices share the same join key. For example
Should have ranges What I see now is for some extreme case I can get a joined buffered data when being called Like |
I don't get the question clearly. You have |
Thanks @viirya it's not indices, it is a raw data. Let me rephrase it. If I have a left table
and right table
And join key is A and Filter is on column B In 1 Batch. join_array [10] Range 1..3 - which is correct as rownumbers 1 and 2 related to join key 10 |
if you debug |
Would you let me know how do you cut the 3 batches among the 6 buffered rows? |
I believe it depends on batch_size, output_size. What I have observed the buffered batch of 6 rows can be processed differently. 3 + 1 + 1 + 1, or 1 + 1 + 1 + 1 + 1 + 1, or 1 batch of 6 rows. I think @korowa mentioned it here
For the simplicity lets consider the test in #12082 (comment) When I debug the freeze_streamed I can see the buffered data is coming as
What are ranges here? the doc says
but how |
|
That matches my understanding of these ranges in buffered batches.
@comphead, I've tried your example and what I see while debugging, there are 3 "versions" of buffered data with the following ranges
I'm able to see them before and after At what point in the code you are able to observe |
I'm running the test from #12082 (comment) and debugging the You can see there that buffered batch with join array
which confuses me, I was thinking only buffered batches that contains a streaming key should be there. But looks like its not. I believe we can get do following:
@viirya @korowa do you think it would be enough to identify that all rows has been processed for the given join key? |
@comphead I've finally got it -- it's like in this case SMJ is trying to produce output for each join key pair (streamed-buffered) -- I guess it's how smj state managements works now -- streamed-side index won't move, until all buffered-side data will be processed, since it's required to identify current ordering.
I'd say that normally you don't need to compare join keys, and you should rely on I also hope to start spending some time on SMJ due to #12359 |
Thanks @korowa I have been experimenting so much with different parts of SMJ and it showed that And yes I was also trying to compare join arrays which potentially can give us a clue that everything is processed, but it might be very expensive |
Which issue does this PR close?
Related to #11555
Closes #.
Rationale for this change
Experiment with approach how to identify a last buffered batch for the given streaming row join key
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?