-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add xxhash64 function support #424
Conversation
27112b6
to
dd9738d
Compare
Thanks @advancedxy. I plan on reviewing this PR today. Could you also update |
Of course, I will update that among other things: such as the review comments and the inspection file: |
I'd like to see the tests use some randomly generated inputs. As a quick hack, I added the following test to test("xxhash64") {
val input = generateStrings(timestampPattern, 8).toDF("a")
withTempPath { dir =>
val data = roundtripParquet(input, dir).coalesce(1)
data.createOrReplaceTempView("t")
val df = spark.sql(s"select a, xxhash64(a) from t order by a")
checkSparkAnswerAndOperator(df)
}
} Some differences:
We could extract the |
Our |
Good catch, and a good way to make sure the impl is correct. Let me check why the test is failing first. |
Found the issue. The Let me try to fix that first. |
See #426 for proposed DataGenerator class |
I filed #427 |
Thanks for filing this. I think it's the same issue for both murmur3 hash and xxhash64. I will submit a pr to fix that first. |
I have submitted the fix in this PR and waiting for CI passes. I will create a separate PR to include the murmur3 hash fix and depends on your #426 in the morning (in Beijing time) first. |
@andygrove @viirya I have created #433 and mark this as a draft. We should merge that first and then come back to this PR . PLAL when you have tome. |
b6e42c3
to
ebb3675
Compare
@andygrove @viirya @parthchandra and @sunchao would you mind to take a look at this? I think it's ready for review. |
Co-authored-by: Liang-Chi Hsieh <[email protected]>
let num_rows = args[0..args.len() - 1] | ||
.iter() | ||
.find_map(|arg| match arg { | ||
ColumnarValue::Array(array) => Some(array.len()), | ||
ColumnarValue::Scalar(_) => None, | ||
}) | ||
.unwrap_or(1); | ||
let mut hashes: Vec<u64> = vec![0_u64; num_rows]; | ||
hashes.fill(*seed as u64); | ||
let arrays = args[0..args.len() - 1] | ||
.iter() | ||
.map(|arg| match arg { | ||
ColumnarValue::Array(array) => array.clone(), | ||
ColumnarValue::Scalar(scalar) => { | ||
scalar.clone().to_array_of_size(num_rows).unwrap() | ||
} | ||
}) | ||
.collect::<Vec<ArrayRef>>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I feel this can be simplified a little bit
let arrays = args[0..args.len() - 1]
...;
let mut hashes: Vec<u64> = vec![0_u64; arrays.len()];
hashes.fill(*seed as u64);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm. I think we have to compute num_rows
first?
DataType::Boolean => { | ||
hash_array_boolean!(BooleanArray, col, i32, $hashes_buffer, $hash_method); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I wonder if we can make BooleanArray
and i32
as macro argument, so that we can reduce this large case match...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, let me give it a try. I will report back if it's too hard to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understands your proposal correctly, do you mean something like:
match col.data_type() {
DataType::Int8 | DataType::Int16: | DataType::Int32 | DataType::Int64 | DataType::UInt8 | DataType::UInt16 | DataType::UInt32 | DataType::UInt64 => {
hash_array_primitive!(get_array_type_of!(col.data_type()), col, get_input_native_type_of!(col.data_type()), $hashes_buffer, $hash_method);
}
....
}
?
I tried to implement that, but couldn't find a way to do that. The col.data_type()
is a runtime value, I don't we can infer it in the compile-time.
Gently ping @andygrove @viirya, do you have any more comments? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me. Thank you @advancedxy
Thanks all for reviewing, @andygrove @viirya @kazuyukitanimura @parthchandra |
* feat: Add xxhash64 function support * Update related docs * Update core/src/execution/datafusion/spark_hash.rs Co-authored-by: Liang-Chi Hsieh <[email protected]> * Update QueriesList results --------- Co-authored-by: Liang-Chi Hsieh <[email protected]> Co-authored-by: Parth Chandra <[email protected]>
Which issue does this PR close?
Part of #205
Closes #344
Rationale for this change
More function coverage
What changes are included in this PR?
How are these changes tested?
New added test.