You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have searched the existing issues, and I could not find an existing issue for this feature
I am requesting a straightforward extension of existing dbt-bigquery functionality, rather than a Big Idea better suited to a discussion
Describe the feature
My team will start building some dbt Python Models, as we use GCP in our company we decided to opt for DataProc. However, after starting a DataProc cluster we faced a specific issue.
We integrated with our DataProc cluster Google's Workload identity federation so we can have better auth management. The problem is that we need to include some extra properties when submitting the job. By Google DataProc API documentation this is possible by including the field properties on submit request.
We noticed that, currently, you only support the field mainJarFileUri
We did a fork on this repo and we tried a few things. We were able to make it work, it was a simple and direct solution but without having much consideration regarding tests, maintainability and others.
However we don't like this approach as it makes more complicated eventual upgrades on the dbt-core and dbt-bigquery packages.
But below, you can see a super simple implementation that worked.
with this, we were able to define the new properties on our dbt_profiles.yaml
# for dbt Python models to be run on Dataproc clustersubmission_method: clusterdataproc_cluster_name: xdataproc_region: xgcs_bucket: xdataproc_properties:
spark.kubernetes.authenticate.executor.serviceAccountName: spark-executor-x-teamspark.kubernetes.authenticate.driver.serviceAccountName: spark-driver-x-teamspark.kubernetes.container.image: gcr.io/...
This solution is somewhat discussable because just fixed my only issue. This might be extended to support the rest of the properties supported by Dataprod SparkJob Endpoint. Maybe by creating a struct that allows in the profiles define multiple parameters like
# for dbt Python models to be run on Dataproc clustersubmission_method: clusterdataproc_cluster_name: xdataproc_region: xgcs_bucket: xdataproc_spark_job:
properties:
spark.kubernetes.authenticate.executor.serviceAccountName: spark-executor-x-teamspark.kubernetes.authenticate.driver.serviceAccountName: spark-driver-x-teamspark.kubernetes.container.image: gcr.io/...loggingConfig:
xymainClass: xpto...
Who will this benefit?
I think everyone will benefit by offering more customization to Dataproc jobs
Are you interested in contributing this feature?
I might be interested in doing that. But i'll need some guidance because i'm not 100% fluent on this project. Things like what should i be ensure that is done before opening a PR (tests, for example)
Anything else?
n/a
The text was updated successfully, but these errors were encountered:
github-actionsbot
changed the title
[Feature] Dataproc submit jobs :: Suport other request properties other than mainJarFileUri
[ADAP-948] [Feature] Dataproc submit jobs :: Suport other request properties other than mainJarFileUri
Oct 17, 2023
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.
Is this your first time submitting a feature request?
Describe the feature
My team will start building some dbt Python Models, as we use GCP in our company we decided to opt for DataProc. However, after starting a DataProc cluster we faced a specific issue.
We integrated with our DataProc cluster Google's Workload identity federation so we can have better auth management. The problem is that we need to include some extra properties when submitting the job. By Google DataProc API documentation this is possible by including the field
properties
on submit request.We noticed that, currently, you only support the field
mainJarFileUri
dbt-bigquery/dbt/adapters/bigquery/python_submissions.py
Lines 97 to 112 in bfc5dc4
Describe alternatives you've considered
We did a fork on this repo and we tried a few things. We were able to make it work, it was a simple and direct solution but without having much consideration regarding tests, maintainability and others.
However we don't like this approach as it makes more complicated eventual upgrades on the dbt-core and dbt-bigquery packages.
But below, you can see a super simple implementation that worked.
https://github.com/dbt-labs/dbt-bigquery/blob/main/dbt/adapters/bigquery/connections.py
dataproc_region: Optional[str] = None dataproc_cluster_name: Optional[str] = None gcs_bucket: Optional[str] = None ++ dataproc_properties: Optional[Dict[str, str]] = None dataproc_batch: Optional[DataprocBatchConfig] = field( metadata={ "serialization_strategy": pass_through, }, default=None, )
Then, on https://github.com/dbt-labs/dbt-bigquery/blob/main/dbt/adapters/bigquery/python_submissions.py
with this, we were able to define the new properties on our
dbt_profiles.yaml
This solution is somewhat discussable because just fixed my only issue. This might be extended to support the rest of the properties supported by Dataprod SparkJob Endpoint. Maybe by creating a struct that allows in the profiles define multiple parameters like
Who will this benefit?
I think everyone will benefit by offering more customization to Dataproc jobs
Are you interested in contributing this feature?
I might be interested in doing that. But i'll need some guidance because i'm not 100% fluent on this project. Things like what should i be ensure that is done before opening a PR (tests, for example)
Anything else?
n/a
The text was updated successfully, but these errors were encountered: