[ADAP-948] [Feature] Dataproc submit jobs :: Suport other request properties other than mainJarFileUri #970

apalmeida1990 · 2023-10-17T12:39:05Z

Is this your first time submitting a feature request?

I have read the expectations for open source contributors
I have searched the existing issues, and I could not find an existing issue for this feature
I am requesting a straightforward extension of existing dbt-bigquery functionality, rather than a Big Idea better suited to a discussion

Describe the feature

My team will start building some dbt Python Models, as we use GCP in our company we decided to opt for DataProc. However, after starting a DataProc cluster we faced a specific issue.

We integrated with our DataProc cluster Google's Workload identity federation so we can have better auth management. The problem is that we need to include some extra properties when submitting the job. By Google DataProc API documentation this is possible by including the field properties on submit request.

We noticed that, currently, you only support the field mainJarFileUri

dbt-bigquery/dbt/adapters/bigquery/python_submissions.py

Lines 97 to 112 in bfc5dc4

    
           def _submit_dataproc_job(self) -> dataproc_v1.types.jobs.Job: 
        
               job = { 
        
                   "placement": {"cluster_name": self._get_cluster_name()}, 
        
                   "pyspark_job": { 
        
                       "main_python_file_uri": self.gcs_location, 
        
                   }, 
        
               } 
        
               operation = self.job_client.submit_job_as_operation(  # type: ignore 
        
                   request={ 
        
                       "project_id": self.credential.execution_project, 
        
                       "region": self.credential.dataproc_region, 
        
                       "job": job, 
        
                   } 
        
               ) 
        
               # check if job failed 
        
               response = operation.result(polling=self.result_polling_policy)

Describe alternatives you've considered

We did a fork on this repo and we tried a few things. We were able to make it work, it was a simple and direct solution but without having much consideration regarding tests, maintainability and others.

However we don't like this approach as it makes more complicated eventual upgrades on the dbt-core and dbt-bigquery packages.

But below, you can see a super simple implementation that worked.

https://github.com/dbt-labs/dbt-bigquery/blob/main/dbt/adapters/bigquery/connections.py

    dataproc_region: Optional[str] = None
    dataproc_cluster_name: Optional[str] = None
    gcs_bucket: Optional[str] = None
++  dataproc_properties: Optional[Dict[str, str]] = None

    dataproc_batch: Optional[DataprocBatchConfig] = field(
        metadata={
            "serialization_strategy": pass_through,
        },
        default=None,
    )

Then, on https://github.com/dbt-labs/dbt-bigquery/blob/main/dbt/adapters/bigquery/python_submissions.py

def _submit_dataproc_job(self) -> dataproc_v1.types.jobs.Job:
        job = {
            "placement": {"cluster_name": self._get_cluster_name()},
            "pyspark_job": {
                "main_python_file_uri": self.gcs_location,
            },
        }

++        if self.credential.properties is not None:
++            job["pyspark_job"]["properties"] = self.credential.properties

            operation = self.job_client.submit_job_as_operation(  # type: ignore
            request={
                "project_id": self.credential.execution_project,
                "region": self.credential.dataproc_region,
                "job": job,
            }
        )

with this, we were able to define the new properties on our dbt_profiles.yaml

      # for dbt Python models to be run on Dataproc cluster
      submission_method: cluster
      dataproc_cluster_name: x
      dataproc_region: x
      gcs_bucket: x
      dataproc_properties:
        spark.kubernetes.authenticate.executor.serviceAccountName: spark-executor-x-team
        spark.kubernetes.authenticate.driver.serviceAccountName: spark-driver-x-team
        spark.kubernetes.container.image: gcr.io/...

This solution is somewhat discussable because just fixed my only issue. This might be extended to support the rest of the properties supported by Dataprod SparkJob Endpoint. Maybe by creating a struct that allows in the profiles define multiple parameters like

      # for dbt Python models to be run on Dataproc cluster
      submission_method: cluster
      dataproc_cluster_name: x
      dataproc_region: x
      gcs_bucket: x
      dataproc_spark_job:
        properties:
          spark.kubernetes.authenticate.executor.serviceAccountName: spark-executor-x-team
          spark.kubernetes.authenticate.driver.serviceAccountName: spark-driver-x-team
          spark.kubernetes.container.image: gcr.io/...
        loggingConfig:
          x
          y
        mainClass: xpto
        ...

Who will this benefit?

I think everyone will benefit by offering more customization to Dataproc jobs

Are you interested in contributing this feature?

I might be interested in doing that. But i'll need some guidance because i'm not 100% fluent on this project. Things like what should i be ensure that is done before opening a PR (tests, for example)

Anything else?

n/a

The text was updated successfully, but these errors were encountered:

github-actions · 2024-04-15T02:50:29Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions · 2024-04-23T01:49:34Z

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

apalmeida1990 added enhancement New feature or request triage labels Oct 17, 2023

github-actions bot changed the title ~~[Feature] Dataproc submit jobs :: Suport other request properties other than mainJarFileUri~~ [ADAP-948] [Feature] Dataproc submit jobs :: Suport other request properties other than mainJarFileUri Oct 17, 2023

github-actions bot added the Stale label Apr 15, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ADAP-948] [Feature] Dataproc submit jobs :: Suport other request properties other than mainJarFileUri #970

[ADAP-948] [Feature] Dataproc submit jobs :: Suport other request properties other than mainJarFileUri #970

apalmeida1990 commented Oct 17, 2023

github-actions bot commented Apr 15, 2024

github-actions bot commented Apr 23, 2024

[ADAP-948] [Feature] Dataproc submit jobs :: Suport other request properties other than mainJarFileUri #970

[ADAP-948] [Feature] Dataproc submit jobs :: Suport other request properties other than mainJarFileUri #970

Comments

apalmeida1990 commented Oct 17, 2023

Is this your first time submitting a feature request?

Describe the feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

github-actions bot commented Apr 15, 2024

github-actions bot commented Apr 23, 2024