Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add xxhash64 function support #424

Merged
merged 4 commits into from
Jun 3, 2024

Conversation

advancedxy
Copy link
Contributor

Which issue does this PR close?

Part of #205
Closes #344

Rationale for this change

More function coverage

What changes are included in this PR?

  1. include twox-hash as a dep in rust
  2. implement xxhash64 related method in rust side
  3. glue code to bridge the jvm and rust

How are these changes tested?

New added test.

@andygrove
Copy link
Member

Thanks @advancedxy. I plan on reviewing this PR today.

Could you also update docs/source/user-guide/expressions.md to add xxhash64 as a supported expression?

@advancedxy
Copy link
Contributor Author

Could you also update docs/source/user-guide/expressions.md to add xxhash64 as a supported expression?

Of course, I will update that among other things: such as the review comments and the inspection file: spark/inspections/CometTPCDSQueriesList-results.txt in a followup commit.

@andygrove
Copy link
Member

I'd like to see the tests use some randomly generated inputs.

As a quick hack, I added the following test to CometCastSuite and it shows some differences in results between Spark and Comet.

  test("xxhash64") {
    val input = generateStrings(timestampPattern, 8).toDF("a")
    withTempPath { dir =>
      val data = roundtripParquet(input, dir).coalesce(1)
      data.createOrReplaceTempView("t")
      val df = spark.sql(s"select a, xxhash64(a) from t order by a")
      checkSparkAnswerAndOperator(df)
    }
  }

Some differences:

!== Correct Answer - 1000 ==           == Spark Answer - 1000 ==
 struct<a:string,xxhash64(a):bigint>   struct<a:string,xxhash64(a):bigint>
![,-7444071767201028348]               [,-1205034819632174695]
![	 23,-1992628079781282865]           [	 23,4312780814362028915]
![	31T3,5857608402363468958]           [	31T3,6089516869931970265]

We could extract the generate* methods from CometCastSuite into a separate DataGenerator class that other test suites can leverage.

@andygrove
Copy link
Member

Our hash implementation is also not compatible with Spark. I will file an issue for that.

@advancedxy
Copy link
Contributor Author

I'd like to see the tests use some randomly generated inputs.

As a quick hack, I added the following test to CometCastSuite and it shows some differences in results between Spark and Comet.

  test("xxhash64") {
    val input = generateStrings(timestampPattern, 8).toDF("a")
    withTempPath { dir =>
      val data = roundtripParquet(input, dir).coalesce(1)
      data.createOrReplaceTempView("t")
      val df = spark.sql(s"select a, xxhash64(a) from t order by a")
      checkSparkAnswerAndOperator(df)
    }
  }

Some differences:

!== Correct Answer - 1000 ==           == Spark Answer - 1000 ==
 struct<a:string,xxhash64(a):bigint>   struct<a:string,xxhash64(a):bigint>
![,-7444071767201028348]               [,-1205034819632174695]
![	 23,-1992628079781282865]           [	 23,4312780814362028915]
![	31T3,5857608402363468958]           [	31T3,6089516869931970265]

We could extract the generate* methods from CometCastSuite into a separate DataGenerator class that other test suites can leverage.

Good catch, and a good way to make sure the impl is correct. Let me check why the test is failing first.

@advancedxy
Copy link
Contributor Author

Let me check why the test is failing first.

Found the issue. The create_hashes_dictionary doesn't handle input hashes correctly, it affects both murmur3hash and this new xxhash64 method.

Let me try to fix that first.

@andygrove
Copy link
Member

See #426 for proposed DataGenerator class

@andygrove
Copy link
Member

Our hash implementation is also not compatible with Spark. I will file an issue for that.

I filed #427

@advancedxy
Copy link
Contributor Author

Our hash implementation is also not compatible with Spark. I will file an issue for that.

I filed #427

Thanks for filing this. I think it's the same issue for both murmur3 hash and xxhash64. I will submit a pr to fix that first.

@advancedxy
Copy link
Contributor Author

advancedxy commented May 14, 2024

Found the issue. The create_hashes_dictionary doesn't handle input hashes correctly, it affects both murmur3hash and this new xxhash64 method.

Let me try to fix that first.

I have submitted the fix in this PR and waiting for CI passes. I will create a separate PR to include the murmur3 hash fix and depends on your #426 in the morning (in Beijing time) first.

@advancedxy
Copy link
Contributor Author

@andygrove @viirya I have created #433 and mark this as a draft. We should merge that first and then come back to this PR . PLAL when you have tome.

@advancedxy advancedxy marked this pull request as ready for review May 28, 2024 01:14
@advancedxy
Copy link
Contributor Author

@andygrove @viirya @parthchandra and @sunchao would you mind to take a look at this? I think it's ready for review.

Comment on lines +685 to +702
let num_rows = args[0..args.len() - 1]
.iter()
.find_map(|arg| match arg {
ColumnarValue::Array(array) => Some(array.len()),
ColumnarValue::Scalar(_) => None,
})
.unwrap_or(1);
let mut hashes: Vec<u64> = vec![0_u64; num_rows];
hashes.fill(*seed as u64);
let arrays = args[0..args.len() - 1]
.iter()
.map(|arg| match arg {
ColumnarValue::Array(array) => array.clone(),
ColumnarValue::Scalar(scalar) => {
scalar.clone().to_array_of_size(num_rows).unwrap()
}
})
.collect::<Vec<ArrayRef>>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I feel this can be simplified a little bit

let arrays = args[0..args.len() - 1]
   ...;
let mut hashes: Vec<u64> = vec![0_u64; arrays.len()];
hashes.fill(*seed as u64);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. I think we have to compute num_rows first?

Comment on lines +294 to +296
DataType::Boolean => {
hash_array_boolean!(BooleanArray, col, i32, $hashes_buffer, $hash_method);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I wonder if we can make BooleanArray and i32 as macro argument, so that we can reduce this large case match...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, let me give it a try. I will report back if it's too hard to do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understands your proposal correctly, do you mean something like:

    match col.data_type() { 
        DataType::Int8 | DataType::Int16: | DataType::Int32 | DataType::Int64 | DataType::UInt8 | DataType::UInt16 | DataType::UInt32 | DataType::UInt64 => {
            hash_array_primitive!(get_array_type_of!(col.data_type()), col, get_input_native_type_of!(col.data_type()), $hashes_buffer, $hash_method);
        }
        ....
    }

?

I tried to implement that, but couldn't find a way to do that. The col.data_type() is a runtime value, I don't we can infer it in the compile-time.

@advancedxy
Copy link
Contributor Author

Gently ping @andygrove @viirya, do you have any more comments?

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me. Thank you @advancedxy

@andygrove andygrove merged commit c79bd5c into apache:main Jun 3, 2024
43 checks passed
@advancedxy
Copy link
Contributor Author

advancedxy commented Jun 4, 2024

Thanks all for reviewing, @andygrove @viirya @kazuyukitanimura @parthchandra

himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
* feat: Add xxhash64 function support

* Update related docs

* Update core/src/execution/datafusion/spark_hash.rs

Co-authored-by: Liang-Chi Hsieh <[email protected]>

* Update QueriesList results

---------

Co-authored-by: Liang-Chi Hsieh <[email protected]>
Co-authored-by: Parth Chandra <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

XxHash64 hash function support
5 participants