Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Enable Comet shuffle with AQE coalesce partitions #834

Merged
merged 34 commits into from
Aug 17, 2024

Conversation

viirya
Copy link
Member

@viirya viirya commented Aug 16, 2024

Which issue does this PR close?

Closes #387.

Rationale for this change

What changes are included in this PR?

How are these changes tested?

@viirya
Copy link
Member Author

viirya commented Aug 16, 2024

This is a copy of #651 but removed some changes. I'd like to see if "os cannot spawn new native thread" error could be removed.

And I also want to check if core-1 can pass without these changes on ubuntu-24.04.

@viirya viirya marked this pull request as draft August 16, 2024 06:01
@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 0% with 1 line in your changes missing coverage. Please review.

Project coverage is 33.92%. Comparing base (380f03d) to head (0850404).

Files Patch % Lines
...org/apache/comet/CometSparkSessionExtensions.scala 0.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #834      +/-   ##
============================================
+ Coverage     33.83%   33.92%   +0.09%     
  Complexity      870      870              
============================================
  Files           112      112              
  Lines         42970    42909      -61     
  Branches       9466     9473       +7     
============================================
+ Hits          14538    14557      +19     
+ Misses        25446    25352      -94     
- Partials       2986     3000      +14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@viirya
Copy link
Member Author

viirya commented Aug 16, 2024

After changing to ubuntu 24.04, the negative ref count error like following disappears on Spark 3.4.3, although it still happens on Spark 4.0.0 and Spark 3.5.1.

Note that it only happens on spark-sql-sql/core-1, other Spark SQL tests are passed.

[info]   		Suppressed: java.lang.IllegalStateException: RefCnt has gone negative
[info]   			at org.apache.comet.shaded.arrow.util.Preconditions.checkState(Preconditions.java:465)
[info]   			at org.apache.comet.shaded.arrow.memory.BufferLedger.release(BufferLedger.java:131)
[info]   			at org.apache.comet.shaded.arrow.memory.BufferLedger.release(BufferLedger.java:105)
[info]   			at org.apache.comet.shaded.arrow.vector.BaseValueVector.releaseBuffer(BaseValueVector.java:114)
[info]   			at org.apache.comet.shaded.arrow.vector.BaseFixedWidthVector.clear(BaseFixedWidthVector.java:248)
[info]   			at org.apache.comet.shaded.arrow.vector.BaseFixedWidthVector.close(BaseFixedWidthVector.java:238)
[info]   			at org.apache.comet.vector.CometVector.close(CometVector.java:210)```

@viirya
Copy link
Member Author

viirya commented Aug 16, 2024

As it cannot be reproduced locally and on internal CI, I'm going to disable columnar shuffle in spark-sql-sql/core-1 for Spark 4.0.0 and Spark 3.5.1 temporarily. We can revisit this issue once we find a certain way to reproduce it.

@viirya
Copy link
Member Author

viirya commented Aug 16, 2024

For Spark 3.4.3, it is also probably negative ref count error happening on core-1 (e.g., in last run). So I disabled Comet shuffle for it too.

@viirya viirya marked this pull request as ready for review August 17, 2024 05:54
@viirya viirya requested a review from andygrove August 17, 2024 05:54
Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you @viirya!

@viirya viirya merged commit 3cff826 into apache:main Aug 17, 2024
74 checks passed
@viirya
Copy link
Member Author

viirya commented Aug 17, 2024

Thanks @andygrove

@andygrove andygrove mentioned this pull request Aug 19, 2024
5 tasks
himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
* chore: Remove COMET_SHUFFLE_ENFORCE_MODE_ENABLED

* Update plan stability

* Fix

* Fix

* Fix

* Fix

* Fix

* Fix

* Fix

* Fix

* Fix

* Fix

* Fix

* Remove test

* Update

* Use same allocator

* test

* Add synchronized

* test

* Revert "test"

This reverts commit 5574bf5.

* Revert "Add synchronized"

This reverts commit aac200a.

* Fix

* fix

* Update diffs

* Update diffs

* Add CometColumnarBatch

* Change to ubuntu-20.04.

* Change to macos-latest

* Change to ubuntu-24.04

* update 3.4.3..diff

* Update to ubuntu-24.04 for Spark 4.0.0 pipeline

* Revert some changes

* Disable Comet shuffle for Spark SQL core-1 test on Spark 3.5 and 4.0.0

* Disable Comet shuffle for Spark SQL core-1 on Spark 3.4.3 too.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable Comet shuffle with AQE coalesce partitions
3 participants