feat: Support BloomFilterMightContain expr #179

advancedxy · 2024-03-09T16:35:30Z

Which issue does this PR close?

Closes #145

Rationale for this change

More expr coverage

What changes are included in this PR?

define new expr in proto
add SparkBitArray, SparkBloomFilter and BloomFilterMightContain support in the rust side
glue code to transform SparkPlan to proto message in JVM side and proto message to physical expr in the native side
add tests in both JVM and native side

Note: BloomFilterMightContain is only available in Spark 3.3+. So we have to add a separate test directory to test that.

How are these changes tested?

Added new tests.

advancedxy · 2024-03-09T16:44:36Z

cc @viirya and @sunchao.

Due to my limited experience with Rust programming yet, there might be some non-idiomatic code in this PR. Appreciate your comments.

codecov-commenter · 2024-03-09T17:30:03Z

Codecov Report

Attention: Patch coverage is 80.00000% with 3 lines in your changes are missing coverage. Please review.

Project coverage is 33.45%. Comparing base (d069713) to head (3851042).
Report is 1 commits behind head on main.

Files	Patch %	Lines
.../scala/org/apache/comet/serde/QueryPlanSerde.scala	85.71%	0 Missing and 2 partials ⚠️
...la/org/apache/comet/shims/ShimQueryPlanSerde.scala	0.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main     #179      +/-   ##
============================================
+ Coverage     33.31%   33.45%   +0.14%     
- Complexity      767      770       +3     
============================================
  Files           107      107              
  Lines         35375    35432      +57     
  Branches       7658     7696      +38     
============================================
+ Hits          11784    11855      +71     
+ Misses        21144    21108      -36     
- Partials       2447     2469      +22

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

spark/src/test/scala/org/apache/comet/CometCastSuite.scala

core/src/execution/datafusion/util/spark_bloom_filter.rs

spark/src/test/spark-3.3-plus/org/apache/comet/CometExpressionPlusSuite.scala

core/src/execution/datafusion/util/spark_bit_array.rs

spark/src/test/spark-3.3-plus/org/apache/comet/CometExpressionPlusSuite.scala

core/src/execution/datafusion/util/spark_bloom_filter.rs

sunchao · 2024-03-13T06:20:00Z

core/src/execution/datafusion/expressions/bloom_filter_might_contain.rs

+
+    fn evaluate(&self, batch: &RecordBatch) -> DataFusionResult<ColumnarValue> {
+        // lazily get the spark bloom filter
+        if self.bloom_filter.get().is_none() {


Just curious whether there is any clear advantage to lazily initialize this?

The bloom filter's binary is either a literal type or from ScalarSubquery, so I took the approach the lazily initialize that.

But maybe we can simply evaluate it before constructing the expression just like how in_list is constructed, let me address it in the new commit.

The static bloom filter now is evaluated eagerly. Please let me know what you think about the latest change.

core/src/execution/datafusion/expressions/bloom_filter_might_contain.rs

core/src/execution/datafusion/util/spark_bit_array.rs

spark/src/test/spark-3.3-plus/org/apache/comet/CometExpressionPlusSuite.scala

core/src/execution/datafusion/spark_hash.rs

core/src/execution/datafusion/util/spark_bit_array.rs

viirya · 2024-03-13T07:42:47Z

core/src/execution/datafusion/util/spark_bit_array.rs

+
+    pub fn set(&mut self, index: usize) -> bool {
+        if !self.get(index) {
+            self.data[index >> 6] |= 1u64 << (index & 0x3f);


Why index & 0x3f? Spark BitArray doesn't do this.

Java and Rust have different semantics about bit shift left.

For Java, the shit left operator will rotate bits if the number of bits to be shifted are large than 64

jshell> 1 << 65 $1 ==> 2 jshell> 1 << 129 $5 ==> 2

Rust doesn't support this semantic, it will panic at overflow.

1u64 << 65 // panics shift left with overflow

Oh, it is not rotated. Java shift operators defines:

If the promoted type of the left-hand operand is long, then only the six lowest-order bits of the right-hand operand are used as the shift distance. It is as if the right-hand operand were subjected to a bitwise logical AND operator & with the mask value 0x3f (0b111111).[11] The shift distance actually used is therefore always in the range 0 to 63, inclusive.

https://en.wikipedia.org/wiki/Bitwise_operation

Maybe we should add a comment on this?

Oh, it is not rotated.

Hmm, thanks for the correction and the info. I didn't find an authentic place about how java defines its shift operators and thought it was a rotated shift.

Maybe we should add a comment on this?

Of course.

I think this PR is ready for another round of review.

I will address this comment together with other issues if any.

viirya · 2024-03-13T07:54:59Z

pom.xml

@@ -494,6 +494,7 @@ under the License.
        <spark.version>3.2.2</spark.version>
        <spark.version.short>3.2</spark.version.short>
        <parquet.version>1.12.0</parquet.version>
+        <additional.test.source>spark-3.2</additional.test.source>


Do you add spark-3.2 test source?

spark-3.2 is just a place holder, there's no test code to be added in spark-3.2 yet.

I can add the empty spark-3.2 dir though.

I mean do we must add additional.test.source for 3.2?

It's used in the plugin to configure additional test source.

<executions> <execution> <id>add-test-source</id> <phase>generate-test-sources</phase> <goals> <goal>add-test-source</goal> </goals> <configuration> <sources> <source>src/test/${additional.test.source}</source> </sources> </configuration> </execution> </executions>

If we don't add a placeholder here, the configuration will be wrong or has to be configured conditionally.

spark/src/test/spark-3.3-plus/org/apache/comet/CometExpressionPlusSuite.scala

advancedxy · 2024-03-13T12:01:10Z

pom.xml

@@ -494,6 +495,8 @@ under the License.
        <spark.version>3.2.2</spark.version>
        <spark.version.short>3.2</spark.version.short>
        <parquet.version>1.12.0</parquet.version>
+        <!-- we don't add special test suits for spark-3.2, so a not existed dir is specified-->


@viirya I add a comment here. Hopefully this addresses your concern.

the <additional.test.source> property is now added as a global property, which will make IDEs happy.
It would also simply work out of box for ./mvnw commands with out additional Profiles.

sunchao

Looks mostly good, just a few more nits

sunchao · 2024-03-13T17:05:56Z

core/src/execution/datafusion/expressions/bloom_filter_might_contain.rs

+    ) -> Self {
+        // early evaluate the bloom_filter_expr to get the actual bloom filter
+        let bloom_filter = evaluate_bloom_filter(&bloom_filter_expr)
+            .expect("bloom_filter_expr could be evaluated statically");


"could be" -> "could not be"? also we can consider returning Result and change this to try_new, but not a big deal.

Hmmm. I thought the expect message is a precondition message.

try_new seems better, let me change to that.

In that case it's better to say "bloom_filter_expr should be evaluated successfully"?

Yeah. should should be used.

Anyway, I changed it with try_new returning Result.

sunchao · 2024-03-13T17:11:05Z

core/src/execution/datafusion/util/spark_bloom_filter.rs

+#[derive(Debug, Hash)]
+pub struct SparkBloomFilter {
+    bits: SparkBitArray,
+    num_hashes: u32,


nit: better add a comment on this, since it is actually the number of hash functions

I changed the variable to num_hash_functions, which should be more clear?

core/src/execution/datafusion/util/spark_bloom_filter.rs

sunchao · 2024-03-13T17:12:39Z

core/src/execution/datafusion/util/spark_bloom_filter.rs

+        }
+    }
+
+    pub fn put_long(&mut self, item: i64) -> bool {


I think this is not used right now but perhaps it would be in future?

Yeah, it's not used right now. it's added for symmetry and it mostly would be in the future.

spark/src/main/scala/org/apache/comet/shims/ShimQueryPlanSerde.scala

core/src/execution/datafusion/expressions/bloom_filter_might_contain.rs

core/src/execution/datafusion/util/spark_bit_array.rs

core/src/execution/datafusion/util/spark_bloom_filter.rs

advancedxy · 2024-03-14T01:34:51Z

core/src/execution/datafusion/expressions/bloom_filter_might_contain.rs

+            })
+            .unwrap_or_else(|| {
+                // when the bloom filter is null, we should return null for all the input
+                Ok(ColumnarValue::Scalar(ScalarValue::Boolean(None)))


Rather than use ScalarValue::Null, I think ScalarValue::Boolean(None) is more appropriate? Since it contains the data type info

ScalarValue::Boolean(None) is correct. ScalarValue::Null is null type.

sunchao

LGTM

sunchao · 2024-03-14T21:24:20Z

Merged, thanks!

parthchandra reviewed Mar 12, 2024

View reviewed changes

advancedxy mentioned this pull request Mar 13, 2024

build: Enforce scalafix check in CI #203

Merged

sunchao reviewed Mar 13, 2024

View reviewed changes