Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix the incorrect null joined rows for SMJ outer join with join filter #10892

Merged
merged 7 commits into from
Jun 18, 2024

Conversation

viirya
Copy link
Member

@viirya viirya commented Jun 12, 2024

Which issue does this PR close?

Closes #10882.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Jun 12, 2024
@@ -860,6 +867,8 @@ impl SMJStream {
self.reservation.shrink(buffered_batch.size_estimation);
}
} else {
// If the head batch is not fully processed, break the loop.
// Streamed batch will be joined with the head batch in the next step.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this issue. But as I re-read these code, I add some comments to help us understand it again.

@alamb
Copy link
Contributor

alamb commented Jun 12, 2024

Possibly related to #10881

@viirya
Copy link
Member Author

viirya commented Jun 12, 2024

Possibly related to #10881

Hmm, I don't see smj is explicitly enabled there? So I guess it is still using hash join?

@viirya viirya changed the title fix: Fix the incorrect null joined rows for outer join with join filter fix: Fix the incorrect null joined rows for SMJ outer join with join filter Jun 12, 2024
@@ -84,7 +84,6 @@ SELECT * FROM t1 LEFT JOIN t2 ON t1.a = t2.a AND t2.b * 50 <= t1.b
Alice 100 Alice 1
Alice 100 Alice 2
Alice 50 Alice 1
Alice 50 NULL NULL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it mean tests were not correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the results were incorrect.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, as Alice 50 is joined with Alice 1, it should not be joined with nulls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, just checked in DuckDB

D CREATE TABLE t1(a text, b int); insert into t1  VALUES ('Alice', 50), ('Alice', 100), ('Bob', 1);
D CREATE TABLE t2(a text, b int);
D insert into t2 VALUES ('Alice', 2), ('Alice', 1);
D SELECT * FROM t1 LEFT JOIN t2 ON t1.a = t2.a AND t2.b * 50 <= t1.b
>
> ;
┌─────────┬───────┬─────────┬───────┐
│    a    │   b   │    a    │   b   │
│ varchar │ int32 │ varchar │ int32 │
├─────────┼───────┼─────────┼───────┤
│ Alice   │    50 │ Alice   │     1 │
│ Alice   │   100 │ Alice   │     1 │
│ Alice   │   100 │ Alice   │     2 │
│ Bob     │     1 │         │       │
└─────────┴───────┴─────────┴───────┘

@github-actions github-actions bot removed the core Core DataFusion crate label Jun 13, 2024
@@ -1445,9 +1504,13 @@ fn get_buffered_columns(
/// `streamed_indices` have the same length as `mask`
/// `matched_indices` array of streaming indices that already has a join filter match
/// `scanning_buffered_offset` current buffered offset across batches
///
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

streamed_columns.extend(buffered_columns);
let columns = streamed_columns;

Ok(Some(RecordBatch::try_new(schema.clone(), columns)?))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do just

Ok(Some(RecordBatch::try_new(schema.clone(), streamed_columns.extend(buffered_columns))?))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, it is actually

Ok(Some(RecordBatch::try_new(
    schema.clone(),
    streamed_columns,
)?))

let buffered_indices = UInt64Array::from_iter_values(
buffered_batch.join_filter_failed_idxs.iter().copied(),
);
if let Some(record_batch) = produce_buffered_null_batch(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may want to have a small helper function to reduce the duplicated call?

Copy link
Member Author

@viirya viirya Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already pull common one to produce_buffered_null_batch. It is duplicated as the producing logic is same.

You mean to inline produce_buffered_null_batch code in the for?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking to have a small helper function to wrap

          if let Some(record_batch) = produce_buffered_null_batch(
                 &self.schema,
                 &self.streamed_schema,
                 &buffered_indices,
                 buffered_batch,
             )? {
                 self.output_record_batches.push(record_batch);
             }

I just realized it is a formatting, the block has 8 lines but in fact its 1 condition and 1 instruction, so lets leave it as is.

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @viirya 💪

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @viirya I think its good to go

@viirya
Copy link
Member Author

viirya commented Jun 17, 2024

Thank you @alamb @comphead. Let me resolve the conflicts before merging.

@alamb alamb merged commit b26c1b8 into apache:main Jun 18, 2024
23 checks passed
@alamb
Copy link
Contributor

alamb commented Jun 18, 2024

Thanks @viirya and @comphead

@viirya viirya deleted the fix_outer_join branch June 18, 2024 14:30
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
…filter (apache#10892)

* fix: Fix the incorrect null joined rows for outer join with join filter

* Update datafusion/physical-plan/src/joins/sort_merge_join.rs

Co-authored-by: Oleks V <[email protected]>

* Update datafusion/physical-plan/src/joins/sort_merge_join.rs

Co-authored-by: Oleks V <[email protected]>

* Update datafusion/physical-plan/src/joins/sort_merge_join.rs

Co-authored-by: Oleks V <[email protected]>

* Update datafusion/physical-plan/src/joins/sort_merge_join.rs

Co-authored-by: Oleks V <[email protected]>

* For review

---------

Co-authored-by: Oleks V <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SMJ: incorrect result for filtered right outer join
3 participants