Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Will Comet support closed-source forks of Apache Spark (e.g. CSP versions)? #414

Open
andygrove opened this issue May 10, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@andygrove
Copy link
Member

What is the problem the feature request solves?

We have our first PR up that works around an issue with Comet working with AWS Spark (#412).

I think we need to carefully consider our stance on supporting closed-source forks of Spark from the cloud service providers.

Supporting closed-source Spark versions is challenging for many reasons:

  • There are often custom operators and expressions that Comet will not be aware of, and, therefore will not be able to map to native versions
  • Software updates can be pushed out at any time, potentially breaking compatibility with Comet
  • It is difficult to debug issues because source is not available (and reverse engineering is prohibited)
  • We would ideally need CI to test against all supported CSP versions
  • It increases the maintenance burden for all contributors, and not all contributors will have access to the CSP versions for testing

If the community desires to maintain Comet versions that can work with CSP Spark versions, then I think we would need to find an approach that allows those contributors to extend the "core" Comet project and add CSP support without adding maintenance burden for the core project.

One idea, for example, would be to keep the core datafusion-comet project compatible with OSS Apache Spark, and then have specific downstream repositories such as datafusion-comet-aws that extend the project to support a specific CSP.

Describe the potential solution

No response

Additional context

No response

@viirya
Copy link
Member

viirya commented May 10, 2024

Thanks @andygrove for creating this.

I think we don't claim that Comet supports for closed source forks of Spark right now. It would be impossible to make such claims as we don't have such resources to make sure it happens.

For #412, I think although it is proposed to support AWS Spark, but the patch actually can be seen as a prevention to take unexpected constructors which have different parameters. I think it makes sense to do.

@viirya
Copy link
Member

viirya commented May 10, 2024

There is also a reported compatibility issue with Databricks Spark: #190

@andygrove
Copy link
Member Author

I plan on creating a PR to update our documentation to make it clear that we only support Apache Spark and not other Spark implementations.

@parthchandra
Copy link
Contributor

I agree that we cannot support (i.e. guarantee compatibility) with proprietary forks, but I guess it is OK to accept PRs like #412 since it doesn't break anything and can only increase adoption.
If the volume of such PRs becomes too large we can consider a contrib directory or even another repo.

@eejbyfeldt
Copy link
Contributor

Based on the dicussion in #190 is sounds at bit like databricks might be shipping patches that are not by in the https://www.apache.org/foundation/marks/downstream.html (Or at least my reading of it).

Not that following it guarantees compatiability, but it sounds like it should make it less likely to have issues.

himadripal pushed a commit to himadripal/datafusion-comet that referenced this issue Sep 7, 2024
… regressions (apache#414)

* Set internal default configs

* add header

* comment out shuffle mode check

* revert shuffle mode default change

* revert enable shuffle by default

* update defaults
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants