Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: Move random data generation methods from CometCastSuite to new DataGenerator class #426

Merged
merged 5 commits into from
May 15, 2024

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented May 14, 2024

Which issue does this PR close?

N/A

Rationale for this change

Start moving random data generation out of CometCastSuite and into a class that can be used from other test suites.

What changes are included in this PR?

  • Refactor data generator
  • Mark CAST string -> integer as incompatible because the data gen changes exposed new issues

How are these changes tested?

Existing tests

@@ -907,28 +866,6 @@ class CometCastSuite extends CometTestBase with AdaptiveSparkPlanHelper {
}
}

// TODO Commented out to work around scalafix since this is currently unused.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this commented out code since it is no longer needed

Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes looks good to me except one minor comment.

However, the CI is failing, you may need to take a look at that.

@@ -35,6 +35,9 @@ import org.apache.comet.expressions.{CometCast, Compatible}
class CometCastSuite extends CometTestBase with AdaptiveSparkPlanHelper {
import testImplicits._

/** Create a data generator using a fixed seed so that tests are reproducible */
private val gen = new DataGenerator(new Random(42))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to make this instance a field in the Companion object of RandomDataGenerator?

So that other test suites can all use the same random data generator by default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make sure I understand what you are suggesting, I think you are saying to add this:

object DataGenerator {
  val DEFAULT = new DataGenerator(new Random(42))
}

And then reference that from the test suite?

private val gen = DataGenerator.DEFAULT

I think this could lead to some non-deterministic behavior. For example, running CometCastSuite on its own from my IDE would start with the freshly created generator, but running the same suite after after test suites have run would result in different inputs.

I think it would be better for each test to create a new instance of the generator with a fixed seed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we could define it as a def rather than a val to force a new creation each time. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the new suggested approach, that’s better and should be used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @advancedxy. I have implemented your suggestion.

@andygrove
Copy link
Member Author

However, the CI is failing, you may need to take a look at that.

Yes, this change effectivey changed the seed and it has exposed new compatibility issues in existing tests. I am investigating.

@andygrove andygrove requested a review from viirya May 14, 2024 17:01
@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 34.18%. Comparing base (14494d3) to head (cf779df).
Report is 9 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main     #426      +/-   ##
============================================
+ Coverage     34.02%   34.18%   +0.16%     
+ Complexity      857      850       -7     
============================================
  Files           116      116              
  Lines         38565    38552      -13     
  Branches       8517     8523       +6     
============================================
+ Hits          13120    13178      +58     
+ Misses        22691    22609      -82     
- Partials       2754     2765      +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -108,7 +108,7 @@ object CometCast {
Compatible()
case DataTypes.ByteType | DataTypes.ShortType | DataTypes.IntegerType |
DataTypes.LongType =>
Compatible()
Incompatible(Some("Not all invalid inputs are detected"))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed #431 to fix this

private val gen = DataGenerator.DEFAULT

/** Number of random data items to generate in each test */
private val dataSize = 10000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the size is increased 10x?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. At 1000 I saw string -> short fail, but not the other string -> integer tests. When I increased this to 10000 then all of the string -> integer tests failed, so seems like we get better coverage at this level

Double.NaN,
Double.PositiveInfinity,
Double.NegativeInfinity,
0.0d) ++
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we cover

      1.0d,
       -1.0d,
       Int.MinValue.toDouble,
       Int.MaxValue.toDouble,
       0.0d,
       -0.0d

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. But I think it could be addressed in a follow-up pr.

Double.NaN,
Double.PositiveInfinity,
Double.NegativeInfinity,
0.0d) ++
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. But I think it could be addressed in a follow-up pr.

@viirya viirya merged commit fcf7d5b into apache:main May 15, 2024
40 checks passed
@viirya
Copy link
Member

viirya commented May 15, 2024

Merged. Thanks @andygrove

@andygrove andygrove deleted the datagen branch May 15, 2024 19:01
himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
… DataGenerator class (apache#426)

* add new DataGenerator class

* move more data gen methods

* ignore some failing tests, update compat docs

* address feedback

* fix regression

(cherry picked from commit fcf7d5b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants