Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAP-948] [Feature] Dataproc submit jobs :: Suport other request properties other than mainJarFileUri #970

Closed
3 tasks done
apalmeida1990 opened this issue Oct 17, 2023 · 2 comments
Labels
enhancement New feature or request Stale triage

Comments

@apalmeida1990
Copy link

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt-bigquery functionality, rather than a Big Idea better suited to a discussion

Describe the feature

My team will start building some dbt Python Models, as we use GCP in our company we decided to opt for DataProc. However, after starting a DataProc cluster we faced a specific issue.

We integrated with our DataProc cluster Google's Workload identity federation so we can have better auth management. The problem is that we need to include some extra properties when submitting the job. By Google DataProc API documentation this is possible by including the field properties on submit request.

We noticed that, currently, you only support the field mainJarFileUri

def _submit_dataproc_job(self) -> dataproc_v1.types.jobs.Job:
job = {
"placement": {"cluster_name": self._get_cluster_name()},
"pyspark_job": {
"main_python_file_uri": self.gcs_location,
},
}
operation = self.job_client.submit_job_as_operation( # type: ignore
request={
"project_id": self.credential.execution_project,
"region": self.credential.dataproc_region,
"job": job,
}
)
# check if job failed
response = operation.result(polling=self.result_polling_policy)

Describe alternatives you've considered

We did a fork on this repo and we tried a few things. We were able to make it work, it was a simple and direct solution but without having much consideration regarding tests, maintainability and others.

However we don't like this approach as it makes more complicated eventual upgrades on the dbt-core and dbt-bigquery packages.

But below, you can see a super simple implementation that worked.

https://github.com/dbt-labs/dbt-bigquery/blob/main/dbt/adapters/bigquery/connections.py

    dataproc_region: Optional[str] = None
    dataproc_cluster_name: Optional[str] = None
    gcs_bucket: Optional[str] = None
++  dataproc_properties: Optional[Dict[str, str]] = None

    dataproc_batch: Optional[DataprocBatchConfig] = field(
        metadata={
            "serialization_strategy": pass_through,
        },
        default=None,
    )

Then, on https://github.com/dbt-labs/dbt-bigquery/blob/main/dbt/adapters/bigquery/python_submissions.py

def _submit_dataproc_job(self) -> dataproc_v1.types.jobs.Job:
        job = {
            "placement": {"cluster_name": self._get_cluster_name()},
            "pyspark_job": {
                "main_python_file_uri": self.gcs_location,
            },
        }

++        if self.credential.properties is not None:
++            job["pyspark_job"]["properties"] = self.credential.properties

            operation = self.job_client.submit_job_as_operation(  # type: ignore
            request={
                "project_id": self.credential.execution_project,
                "region": self.credential.dataproc_region,
                "job": job,
            }
        )

with this, we were able to define the new properties on our dbt_profiles.yaml

      # for dbt Python models to be run on Dataproc cluster
      submission_method: cluster
      dataproc_cluster_name: x
      dataproc_region: x
      gcs_bucket: x
      dataproc_properties:
        spark.kubernetes.authenticate.executor.serviceAccountName: spark-executor-x-team
        spark.kubernetes.authenticate.driver.serviceAccountName: spark-driver-x-team
        spark.kubernetes.container.image: gcr.io/...

This solution is somewhat discussable because just fixed my only issue. This might be extended to support the rest of the properties supported by Dataprod SparkJob Endpoint. Maybe by creating a struct that allows in the profiles define multiple parameters like

      # for dbt Python models to be run on Dataproc cluster
      submission_method: cluster
      dataproc_cluster_name: x
      dataproc_region: x
      gcs_bucket: x
      dataproc_spark_job:
        properties:
          spark.kubernetes.authenticate.executor.serviceAccountName: spark-executor-x-team
          spark.kubernetes.authenticate.driver.serviceAccountName: spark-driver-x-team
          spark.kubernetes.container.image: gcr.io/...
        loggingConfig:
          x
          y
        mainClass: xpto
        ...

Who will this benefit?

I think everyone will benefit by offering more customization to Dataproc jobs

Are you interested in contributing this feature?

I might be interested in doing that. But i'll need some guidance because i'm not 100% fluent on this project. Things like what should i be ensure that is done before opening a PR (tests, for example)

Anything else?

n/a

@apalmeida1990 apalmeida1990 added enhancement New feature or request triage labels Oct 17, 2023
@github-actions github-actions bot changed the title [Feature] Dataproc submit jobs :: Suport other request properties other than mainJarFileUri [ADAP-948] [Feature] Dataproc submit jobs :: Suport other request properties other than mainJarFileUri Oct 17, 2023
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Apr 15, 2024
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Stale triage
Projects
None yet
Development

No branches or pull requests

1 participant