Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parameterize GoogleCloudStorage provider in GcsUtil to unblock gcs-co… #33368

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

clairemcginty
Copy link
Contributor

@clairemcginty clairemcginty commented Dec 12, 2024

Rationale:

I would like to use gcs-connector 3.x, which supports the new Parquet VectorIO feature. However, gcs-connector 3.x also drops Java 8 and targets Java 11, which blocks us from upgrading it directly in Beam, since Beam is still targeting 8 (see #31678).

Additionally, as a Beam user, I can't just upgrade gcs-connector on my end, due to breaking changes in how GoogleCloudStorageImpl is instantiated: in 2.x it has public constructors, but in 3.x it drops the public constructors and enforces a Builder pattern.

Therefore, when running on gcs-connector 3.x, my pipeline throws a NoSuchMethodError from org.apache.beam.sdk.extensions.gcp.util.GcsUtil when it tries to invoke the 2.x constructor: https://github.com/apache/beam/blob/v2.61.0/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/util/GcsUtil.java#L727

This PR adds a pipeline option for a GoogleCloudStorage Provider, so that users who want to use gcs-connector 3.x can be unblocked from doing so. It defaults to invoking the gcs-connector 2.x public constructor, but 3.x users can override it to use the Builder.


GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

Comment on lines +228 to +232
GoogleCloudStorage get(
GoogleCloudStorageOptions options,
Storage storage,
Credentials credentials,
HttpRequestInitializer httpRequestInitializer);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 4 params should cover both the 2.x constructor and the 3.x Builder

Copy link
Contributor

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@clairemcginty
Copy link
Contributor Author

cc @Abacn - wdyt of this workaround for gcs-connector 3.x?

@Abacn
Copy link
Contributor

Abacn commented Dec 13, 2024

Hi, thanks for the investigation. Is the builder constructor also supported on 2.x ? If so we can just change to use it in all case and no need extra options exposed to user

@clairemcginty
Copy link
Contributor Author

Hi, thanks for the investigation. Is the builder constructor also supported on 2.x ? If so we can just change to use it in all case and no need extra options exposed to user

Unfortunately it isn't :/ There is no way to construct a GoogleCloudStorageImpl that works for both gcs-connector 2.x and 3.x (at least as far as I've been able to find).

Copy link
Contributor

@Abacn Abacn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

understand, thanks!

@clairemcginty
Copy link
Contributor Author

understand, thanks!

thanks @Abacn ! should I update CHANGES.md for the 2.62.0 release or the next one?

@Abacn
Copy link
Contributor

Abacn commented Dec 23, 2024

java precommit tineout for several rerun. It's passing on HEAD, could you please taking a look if it is related to this change?

@Abacn
Copy link
Contributor

Abacn commented Dec 27, 2024

after multiple rerun there is indeed a related test failure:

https://github.com/apache/beam/runs/34888346578

testGcpCoreApiSurface (org.apache.beam.sdk.extensions.gcp.GcpCoreApiSurfaceTest) failed

java.lang.AssertionError: 
Expected: API surface to include only:

...

 but: The following disallowed classes appeared on the API surface:
	class com.google.protobuf.AbstractMessage exposed via:
...
interface org.apache.beam.sdk.extensions.gcp.options.GcsOptions$GoogleCloudStorageProvider

looks like it is coded in test to prevent the exposure of unwanted classes. Refactor in a way that does not leak these may fix

@clairemcginty
Copy link
Contributor Author

clairemcginty commented Dec 27, 2024

looks like it is coded in test to prevent the exposure of unwanted classes. Refactor in a way that does not leak these may fix

thanks for looking into it @Abacn ! hmm, this seems challenging to refactor since the exposure is coming from interface com.google.cloud.hadoop.gcsio.GoogleCloudStorage itself... and there's no superclass/subclass that would be appropriate to substitute in here. I guess instead of adding a provider for a GoogleCloudStorage instance, we could add a provider for a BiFunction<StorageResourceId, CreateObjectOptions, WritableByteChannel> ? Although that would leak more implementation details to the user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants