Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#5624] feat(bundles): support ADLS credential provider #5737

Merged
merged 11 commits into from
Dec 17, 2024

Conversation

orenccl
Copy link
Collaborator

@orenccl orenccl commented Dec 3, 2024

What changes were proposed in this pull request?

Add ADLS credential provider

Why are the changes needed?

Fix: #5624

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added a unit test and verified successful access to the ADLS container.
image

Supplementary Information

Chose to use the following versions of Azure libraries:

  • azure-identity = "1.13.1"
  • azure-storage-file-datalake = "12.20.0"
  • azure-core-http-okhttp = "1.12.0"

Instead of the latest versions because, although the official documentation states support for Java 8 and later, the latest versions appear to have been compiled with Java 21. This caused compilation issues in the Gravitino environment. Therefore, downgraded and tested to find the latest usable versions.

@orenccl orenccl marked this pull request as draft December 3, 2024 08:33
@orenccl orenccl self-assigned this Dec 3, 2024
@orenccl orenccl requested a review from FANNG1 December 3, 2024 08:33
@orenccl
Copy link
Collaborator Author

orenccl commented Dec 3, 2024

Hi @FANNG1 , could you please take a look when you have a moment

@FANNG1
Copy link
Contributor

FANNG1 commented Dec 3, 2024

seems there are class conflictes with netty in Gravitino and netty in azure, My suggestion is to shadow the related packages in azure-bundle like what is done in gcp-bundle

tasks.withType(ShadowJar::class.java) {
  isZip64 = true
  configurations = listOf(project.configurations.runtimeClasspath.get())
  archiveClassifier.set("")

  // Relocate dependencies to avoid conflicts
  relocate("org.apache.httpcomponents", "org.apache.gravitino.gcp.shaded.org.apache.httpcomponents")
  relocate("org.apache.commons", "org.apache.gravitino.gcp.shaded.org.apache.commons")
  relocate("com.google", "org.apache.gravitino.gcp.shaded.com.google")
  relocate("com.fasterxml", "org.apache.gravitino.gcp.shaded.com.fasterxml")
}

@FANNG1
Copy link
Contributor

FANNG1 commented Dec 3, 2024

What is the purpose of IcebergConstants?

IcebergConstants defines the properties for Iceberg catalog and IcebergRESTServer which are used in Gravitino server, Gravitino IcebergRESTServer, Spark connector, etc.

What variables should be included in IcebergCatalogWrapper?

Besides the credential information defined in xxTokenCredential, the other credential related information like endpoint region we'd better pass to the client too, so the client side no need to do extra configuration. For ADSL, I don't known whether it needs some extra configuration(I guess not need), maybe you could try to test whether it works without extra config.

Copy link
Contributor

@tengqm tengqm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm.
a few nits for consideration.

@orenccl orenccl closed this Dec 4, 2024
@orenccl orenccl reopened this Dec 4, 2024
@orenccl orenccl changed the title [Help Wanted][#5624] feat(bundles): support ADSL credential provider [#5624] feat(bundles): support ADSL credential provider Dec 4, 2024
@orenccl orenccl changed the title [#5624] feat(bundles): support ADSL credential provider [#5624] feat(bundles): support ADLS credential provider Dec 4, 2024
@orenccl orenccl marked this pull request as ready for review December 4, 2024 05:59
@orenccl
Copy link
Collaborator Author

orenccl commented Dec 4, 2024

Hi @FANNG1 ,
Thank you for your help earlier, which allowed me to make progress.
This PR is ready for review now, please take a look. 🙇‍♂️
I will update the documentation within a few days in this PR.

@FANNG1
Copy link
Contributor

FANNG1 commented Dec 4, 2024

@orenccl I don't have enough time to test and review this week, may delay until next week, is that ok for you?

@orenccl
Copy link
Collaborator Author

orenccl commented Dec 4, 2024

Hi, @FANNG1
Sure, totally fine. Thanks for letting me know!

@orenccl orenccl marked this pull request as draft December 5, 2024 07:03
@orenccl
Copy link
Collaborator Author

orenccl commented Dec 6, 2024

Originally, I was able to pass the unit tests. However, when I built the Iceberg REST server Docker image and tested it with Spark, I ran into an error.

INSERT INTO testDatabase.test_azure VALUES (1), (2)
24/12/06 22:49:49 ERROR TorrentBroadcast: Store broadcast broadcast_0 fail, remove all pieces of the broadcast
24/12/06 22:49:49 ERROR SparkSQLDriver: Failed in [INSERT INTO testDatabase.test_azure VALUES (1), (2)]
java.io.NotSerializableException: com.azure.storage.common.StorageSharedKeyCredential
Serialization stack:
        - object not serializable (class: com.azure.storage.common.StorageSharedKeyCredential, value: com.azure.storage.common.StorageSharedKeyCredential@63f38fb1)

I found that this is a known issue. Please refer to:

This issue has been addressed in Iceberg version 1.6.0.

How should we handle this problem?

@orenccl orenccl marked this pull request as ready for review December 6, 2024 17:19
@orenccl
Copy link
Collaborator Author

orenccl commented Dec 7, 2024

Using the account name and account key to obtain StorageSharedKeyCredential doesn't work in the Iceberg 1.5.2 + Spark environment; only SAS tokens or DefaultAzureCredentialBuilder can be used.

However, the Gravitino server currently cannot use tokens to write metadata information, and DefaultAzureCredentialBuilder requires users to set environment variables, which cannot be configured through Gravitino conf.


Additional Testing Observations

After encountering this issue, I retested it on another PC that does not have the AWS CLI installed. The unit test failed during the Insert operation.

Previously, the tests might have passed due to some interference from the AWS CLI. However, under normal conditions, the tests should have failed.

For reference, please see: AWS Java SDK Developer Guide - Credentials.

In simple terms: It seems the AWS SDK automatically reads the SAS token temporarily stored in the AWS CLI.

@FANNG1
Copy link
Contributor

FANNG1 commented Dec 11, 2024

The main problem of the current implementation is passing Azure secret key not the token(token name is not correct) to the client side, The StorageSharedKeyCredential key problems exists in the client side not the Iceberg REST server side, which means we could update the Iceberg version when testing. or update the Iceberg version in the server side is also ok but I prefer to do it in another PR

@FANNG1
Copy link
Contributor

FANNG1 commented Dec 11, 2024

After correct the token name for Iceberg, I still failed to run pass the IT for auth failure, seems the token is not generated correctly(I also failed to do the Azure operation with the generated token in the python client). could you dig out the reason?

@orenccl orenccl closed this Dec 11, 2024
@orenccl
Copy link
Collaborator Author

orenccl commented Dec 12, 2024

Updated a version that corrects the token key and passes the IT test.
However, this version's SAS token does not specify a path.
Currently, specifying a path in the SAS token causes it to fail. I am still working on resolving this issue.

@orenccl orenccl reopened this Dec 12, 2024
@orenccl orenccl marked this pull request as draft December 16, 2024 10:44
@orenccl orenccl marked this pull request as ready for review December 16, 2024 15:03
@FANNG1
Copy link
Contributor

FANNG1 commented Dec 17, 2024

LGTM, just one comment, @yuqi1129 do you have time to review? Some changes related to fileset azure properties.

@FANNG1 FANNG1 merged commit 73d5ffc into apache:main Dec 17, 2024
26 checks passed
@FANNG1
Copy link
Contributor

FANNG1 commented Dec 17, 2024

@orenccl , merged to main, it's an import feature for Gravitino, thanks for your work! There are some subsequent issues related like support AzureSecertKeyCredential, support ADLS credential in python client, do you want to continue work on it?

@orenccl
Copy link
Collaborator Author

orenccl commented Dec 17, 2024

@FANNG1
Thank you for providing help and reviewing!
Yes, I would like to continue working on it. I can open an issue later today, or if you create it first, please tag me and assign it to me!

@FANNG1
Copy link
Contributor

FANNG1 commented Dec 17, 2024

@FANNG1 Thank you for providing help and reviewing! Yes, I would like to continue working on it. I can open an issue later today, or if you create it first, please tag me and assign it to me!

thanks, you could create the issue when you are free

@orenccl
Copy link
Collaborator Author

orenccl commented Dec 17, 2024

I think when we implement AzureSecretKeyCredential, we need to upgrade the Iceberg version.
We should upgrade to at least version 1.6.0, as it supports AzureSecretKeyCredential.

For supporting ADLS credentials in the Python client, should we refer to PR #5209 ?

@FANNG1
Copy link
Contributor

FANNG1 commented Dec 17, 2024

I think when we implement AzureSecretKeyCredential, we need to upgrade the Iceberg version. We should upgrade to at least version 1.6.0, as it supports AzureSecretKeyCredential.

For supporting ADLS credentials in the Python client, should we refer to PR #5209 ?

Update Iceberg version to 1.6.0 seems not required because the decrease problems exists in Iceberg REST client side (Spark or Flink) not Iceberg REST server side, but whether updating Iceberg or not both ok to me.

For ADLS credentials in Python client, please refer to #5890

@orenccl orenccl deleted the feat/adlsToken branch December 22, 2024 09:24
jerryshao pushed a commit that referenced this pull request Dec 24, 2024
…nnector (#5952)

### What changes were proposed in this pull request?

1. Most code work is implemented in #5938 #5737 including catalog
properties convert and add Iceberg azure bundle jar, this PR mainly
about test and document.
2. Remove hidden properties of the cloud secret key from the Iceberg
catalog, as Gravitino doesn't have an unified security management yet
and Iceberg REST server need to fetch catalog cloud properties to
initiate `IcebergWrapper` dymaticly. Another benefit is spark connector
does not need to specify the secret key explictly.

Supports ADLS for Iceberg catalog and spark connector

### Why are the changes needed?
Fix: #5954 

### Does this PR introduce _any_ user-facing change?
Yes, the user no need to specify the cloud secret key in spark
connector.

### How was this patch tested?
test in local enviroment
Abyss-lord pushed a commit to Abyss-lord/gravitino that referenced this pull request Dec 29, 2024
…ark connector (apache#5952)

### What changes were proposed in this pull request?

1. Most code work is implemented in apache#5938 apache#5737 including catalog
properties convert and add Iceberg azure bundle jar, this PR mainly
about test and document.
2. Remove hidden properties of the cloud secret key from the Iceberg
catalog, as Gravitino doesn't have an unified security management yet
and Iceberg REST server need to fetch catalog cloud properties to
initiate `IcebergWrapper` dymaticly. Another benefit is spark connector
does not need to specify the secret key explictly.

Supports ADLS for Iceberg catalog and spark connector

### Why are the changes needed?
Fix: apache#5954 

### Does this PR introduce _any_ user-facing change?
Yes, the user no need to specify the cloud secret key in spark
connector.

### How was this patch tested?
test in local enviroment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Subtask] support ADLS credential provider
4 participants