Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug fix: Fix fuzz testcase for cast string to integer #450

Closed
wants to merge 1 commit into from

Conversation

vaibhawvipul
Copy link
Contributor

Which issue does this PR close?

Closes #431 .

Rationale for this change

Removing leading whitespaces: In some inputs, Spark's error messages contain leading whitespaces while Comet's do not, causing the tests to fail. By removing the leading whitespaces from the input, this issue is resolved.

What changes are included in this PR?

removing the leading whitespaces from input df.

How are these changes tested?

Fuzz tests pass.

@vaibhawvipul vaibhawvipul changed the title bug: Fix fuzz testcase for cast string to integer bug fix: Fix fuzz testcase for cast string to integer May 19, 2024
Copy link
Contributor

@kazuyukitanimura kazuyukitanimura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending CI
The title should start with fix: or test:, I think

Comment on lines +540 to +545
castTest(
gen
.generateStrings(dataSize, numericPattern, 8)
.toDF("a")
.withColumn("a", functions.trim($"a")),
DataTypes.IntegerType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to see us fix the compatibility issue rather than skip testing strings that have leading and/or trailing whitespace. I have been looking into this and was going to make some comments here but it turned out to be a bit more involved that I thought it would, so will create a PR soon with my proposed fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main issue is that we are trimming the input before processing, so error messages use the trimmed input instead of the original input. Once I resolved that I saw that we had extra processing for leading and trailing whitespace that is never used (ported from Spark) so I ended up removing that for some performance wins.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is my PR: #453

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@vaibhawvipul vaibhawvipul deleted the issue-431 branch May 21, 2024 00:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: CAST string to integer does not handle all invalid inputs
3 participants