Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_as_spark using config as variable instead of file #483

Open
stevenayers-bge opened this issue Apr 24, 2024 · 12 comments · May be fixed by #573
Open

load_as_spark using config as variable instead of file #483

stevenayers-bge opened this issue Apr 24, 2024 · 12 comments · May be fixed by #573

Comments

@stevenayers-bge
Copy link

Hi all,

I want to use load_as_spark, but instead of saving the config credentials as a file, I want to pull them from a secrets manager (Azure Vault, AWS SecretsManager, Databricks Secret) instead of storing the file on the filesystem. I want to avoid saving the json config to the filesystem for security reasons.

For example:

delta_sharing_config = dbutils.secrets.get(scope="delta-sharing", key="my-config")

df = delta_sharing.load_as_spark(delta_sharing_config + '#<share name>.<schema>.<table name>')

Is this possible?

Thanks

@aimtsou
Copy link

aimtsou commented Apr 25, 2024

@stevenayers-bge: Would you be satisfied with building the json object while bringing the secrets from key vault and any other configuration from application resource configuration or even from the keyvault?

@stevenayers
Copy link

@aimtsou initializing the rest client works fine, for example:

import delta_sharing
from delta_sharing.protocol import DeltaSharingProfile

# GET SECRET VALUE
delta_sharing_config = dbutils.secrets.get(scope="delta-sharing", key="my-config")

# PASS SECRET CONFIG WITHOUT SAVING TO FILE
profile = DeltaSharingProfile.from_json(delta_sharing_config)
client = delta_sharing.SharingClient(profile)

# List tables available in share
client.list_all_tables()

The issue is when you go to read the data into a dataframe. You cannot use SharingClient or DataSharingProfile to authenticate when reading the dataframe. You have to pass the file in the load_as_spark function URL argument:

import delta_sharing

share_file = "/tmp/my-secret-config-stored-in-a-file" # security problem!

df = delta_sharing.load_as_spark(share_file + '#nep_test.reference.settlement_period_calendar')

@stevenayers
Copy link

@aimtsou @linzhou-db this old PR would solve the issue #103

@aimtsou
Copy link

aimtsou commented Apr 25, 2024

@stevenayers,

Now i get your point. Yes the load_as_spark function does not support SharingClient, so you are proposing something different from me, to pass the whole JSON stored in secure storage as a profile to Spark, I proposed to build the JSON object in python which is closer to @zsxwing suggests. I see 2 different implementations to be honest.

My question is what is your incentive for storing the whole file is a secret store? Usually we retrieve only the secret value. I mean in your case the person who has access to this databricks workspace has already access to the whole file since it is in the secret store.

@stevenayers-bge
Copy link
Author

@aimtsou i don't mind how it's done 🙂 Im not sure your PR will help though. load_as_spark doesn't interact with the Python SharingClient.

load_as_spark passes the config file path to the JVM, and the file is read and parsed within the delta sharing scala library.

@stevenayers-bge
Copy link
Author

stevenayers-bge commented Apr 25, 2024

@stevenayers,

Now i get your point. Yes the load_as_spark function does not support SharingClient, so you are proposing something different from me, to pass the whole JSON stored in secure storage as a profile to Spark, I proposed to build the JSON object in python which is closer to @zsxwing suggests. I see 2 different implementations to be honest.

My question is what is your incentive for storing the whole file is a secret store? Usually we retrieve only the secret value. I mean in your case the person who has access to this databricks workspace has already access to the whole file since it is in the secret store.

I don't have an incentive to store the whole file in a secret manager, but it is often easier to store the whole json config in one secret.

It would be equally secure if you only stored the bearerToken in a secrets manager, but then you're storing some of the config on a filesystem, and some in a secrets manager, which is a bit messy.

All I'm proposing is that we can pass profile configuration via parameters, rather than passing in a filepath. So something like:

share_endpoint = ... pulled in from a secretsmanager or whatever
share_token = ... pulled in from a secretsmanager or whatever

delta_sharing.load_as_spark(
    share_name='<share-name>',
    schema_name='<schema name>',
    table_name='my_table',
    endpoint=share_endpoint,
    bearer_token=share_token
)

# or pass a client object?
client: SharingClient = ...
delta_sharing.load_as_spark(client=client, share_name=.... etc)

@aimtsou does that make sense?

@aimtsou
Copy link

aimtsou commented Apr 25, 2024

For me yes and as you said yes there is only a df.load(url) in load_as_spark so indeed it needs to be done on the scala side.

@shcent
Copy link

shcent commented Jun 28, 2024

This is causing me a whole world of pain of Databricks. Having to write the config into a temporary file that sometimes needs a /dbfs prefix and sometimes doesn't. Completely agree that a solution that does not rely on a file base url would be much better.

@stevenayers-bge
Copy link
Author

@linzhou-db @moderakh @zhu-tom @pranavsuku-db do you have any time to look at this? I can also raise a PR with the changes if that helps?

I know it's available in the rust client, and I've seen you're using maturin/pyo3 so it might already be possible, I'm just not very familiar with how rust & python interact.

Please let me know if there's anything I can do to help 🙏

@linzhou-db
Copy link
Collaborator

@stevenayers-bge sure feel free to send out the PR.

@stevenayers stevenayers linked a pull request Sep 14, 2024 that will close this issue
8 tasks
@nuno-s
Copy link

nuno-s commented Oct 31, 2024

This will be a awesome change!

In my opinion this should be possible to do with other readers to, like delta_sharing.load_as_pandas()

@MihalisW
Copy link

Bump on this please - as a Delta Share provider some of our recipients might object to having to persist credentials to a file, even temporarily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants