fix: Skip buffered rows which are not joined with streamed side when checking join filter results #12159

viirya · 2024-08-25T22:16:24Z

Which issue does this PR close?

Closes #.

Rationale for this change

Found a test failure on Spark SQL test suite in apache/datafusion-comet#553 (https://github.com/apache/datafusion-comet/actions/runs/10498306015/job/29082983619?pr=553).

When checking join filter results for buffered rows, we should exclude those rows which are not joined with streamed side (i.e., null slots in buffered_indices). Otherwise, we may wrongly consider those rows passing join filter results and lack of null joined buffered rows as in the test.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

…checking join filter results

viirya · 2024-08-25T22:25:06Z

datafusion/physical-plan/src/joins/sort_merge_join.rs

+                                // If the buffered row is not joined with streamed side,
+                                // skip it.
+                                if buffered_indices.is_null(i) {
+                                    continue;
+                                }
+


I was trying to duplicate the Spark SQL test but the test failure depends on the data distribution. I cannot make it as same as Spark case.

The test case looks like

select * from ( with left as ( select N, L from (select unnest(make_array(1, 2, 3, 4)) N, unnest(make_array('A', 'B', 'C', 'D')) L) where N <= 4 ), right as ( select N, L from (select unnest(make_array(1, 2, 3, 4)) N, unnest(make_array('A', 'B', 'C', 'D')) L) where N >= 3 ) select * from left full join right on left.N = right.N and left.N != 3 ) order by 1, 2, 3, 4;

I want those values to be in same partition to run sort merge join. So the joined batch looks like:

1 A null null 2 B null null # this and above row wrongly considered passing join filter for buffered row [3, C] 3 C 3 C # join filter => false 4 D 4 D # join filter => true

The buffered row [3, C] fails join filter. So it should be output as null joined buffered row, but the two above rows are wrongly considered passing join filter for buffered index 0 (i.e., [3, C]), it will not be output in Comet.

The test case in Comet is apache/datafusion-comet#553 (comment)

comphead

lgtm @viirya I feel for this test would be nice to have a DataFusion test, but you said it depends on data distribution between record batches

viirya · 2024-08-26T16:27:47Z

Thanks @comphead

viirya · 2024-08-26T20:57:11Z

Let me merge this to unblock apache/datafusion-comet#553.

fix: Skip buffered rows which are not joined with streamed side when …

1a31a69

…checking join filter results

github-actions bot added the physical-expr Physical Expressions label Aug 25, 2024

viirya commented Aug 25, 2024

View reviewed changes

viirya requested a review from comphead August 26, 2024 02:59

viirya mentioned this pull request Aug 26, 2024

feat: Support sort merge join with a join condition apache/datafusion-comet#553

Merged

comphead approved these changes Aug 26, 2024

View reviewed changes

viirya merged commit dff590b into apache:main Aug 26, 2024
25 checks passed

viirya deleted the fix_null_indices branch August 26, 2024 20:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Skip buffered rows which are not joined with streamed side when checking join filter results #12159

fix: Skip buffered rows which are not joined with streamed side when checking join filter results #12159

viirya commented Aug 25, 2024 •

edited

Loading

viirya Aug 25, 2024 •

edited

Loading

viirya Aug 26, 2024

comphead left a comment

viirya commented Aug 26, 2024

viirya commented Aug 26, 2024

fix: Skip buffered rows which are not joined with streamed side when checking join filter results #12159

fix: Skip buffered rows which are not joined with streamed side when checking join filter results #12159

Conversation

viirya commented Aug 25, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

viirya Aug 25, 2024 • edited Loading

Choose a reason for hiding this comment

viirya Aug 26, 2024

Choose a reason for hiding this comment

comphead left a comment

Choose a reason for hiding this comment

viirya commented Aug 26, 2024

viirya commented Aug 26, 2024

viirya commented Aug 25, 2024 •

edited

Loading

viirya Aug 25, 2024 •

edited

Loading