Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Regression] Profile no longer accepts staging_bucket param in execution_config #743

Closed
2 tasks done
nickozilla opened this issue May 31, 2023 · 3 comments
Closed
2 tasks done
Assignees
Labels
bug Something isn't working regression

Comments

@nickozilla
Copy link
Contributor

nickozilla commented May 31, 2023

Is this a regression in a recent version of dbt-bigquery?

  • I believe this is a regression in dbt-bigquery functionality
  • I have searched the existing issues, and I could not find an existing issue for this regression

Current Behavior

When running a python model, I am unable to use the profile parameter staging_bucket which lives under:

dataproc_batch:
  environment_config:
    execution_config:
      staging_bucket: "gs:..."

& presents this error:

Unable to parse dataproc_batch as valid batch specification. See https://cloud.google.com/dataproc-serverless/docs/reference/rpc/google.cloud.dataproc.v1#google.cloud.dataproc.v1.Batch. Failed to parse environment_config field: Failed to parse execution_config field: Message type "google.cloud.dataproc.v1.ExecutionConfig" has no field named "staging_bucket" at "Batch.environment_config.execution_config".
 Available Fields(except extensions): "['serviceAccount', 'networkUri', 'subnetworkUri', 'networkTags', 'kmsKey']"..

As far as I can tell this is still the place where this parameter should live - https://github.com/googleapis/python-dataproc/blob/main/google/cloud/dataproc_v1/types/shared.py#L222

This was first noticed today after I built a new container with dbt-core==1.5.1 so I'm not sure if the issue is due to that, or a back ported fix onto dbt-bigquery==1.5.1

Expected/Previous Behavior

Previously the same profile was correctly parsing the parameter and using the staging_bucket correctly. The default behaviour doesnt work for my use case, as the service account that runs dbt in dataproc is only directly permissioned against the specific bucket.

Steps To Reproduce

Dependencies:
dbt-core==1.5.1
dbt-bigquery==1.5.1

Profile:

modelling:
  target: dev
  outputs:
    dev:
      dataset: dev
      job_execution_timeout_seconds: 2000
      job_retries: 1
      location: EU
      method: oauth
      priority: interactive
      project: "{{ env_var('PROJECT_ID') }}"
      threads: 8
      type: bigquery
      dataproc_region: europe-west1
      gcs_bucket: "{{ env_var('PYTHON_DBT_MODELS_BUCKET') }}"
      dataproc_batch:
        environment_config:
          execution_config:
            service_account: "{{ env_var('SERVICE_ACCOUNT') }}"
            subnetwork_uri: "{{ env_var('SUBNET') }}"
            staging_bucket: "{{ env_var('PYTHON_DBT_STAGING_BUCKET') }}"
        pyspark_batch:
          jar_file_uris:
            [
              "gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.13-0.29.0.jar",
            ]
        runtime_config:
          container_image: "europe-docker.pkg.dev/{{ env_var('PYTHON_DBT_CONTAINER') }}"

& then run any python model withdbt run --select python_model

Relevant log output

No response

Environment

- OS: Linux version 5.15.49-linuxkit (root@buildkitsandbox) (gcc (Alpine 10.2.1_pre1) 10.2.1 20201203, GNU ld (GNU Binutils) 2.35.2) #1 SMP PREEMPT Tue Sep 13 07:51:32 UTC 2022
- Python: 3.10
- dbt-core (working version): 1.5.0
- dbt-bigquery (working version): 1.5.1
- dbt-core (regression version): 1.5.1
- dbt-bigquery (regression version): 1.5.1

Additional Context

No response

@nickozilla nickozilla added bug Something isn't working regression triage labels May 31, 2023
@github-actions github-actions bot changed the title [Regression] Profile no longer accepts staging_bucket in execution_config [ADAP-592] [Regression] Profile no longer accepts staging_bucket in execution_config May 31, 2023
@nickozilla nickozilla changed the title [ADAP-592] [Regression] Profile no longer accepts staging_bucket in execution_config [Regression] Profile no longer accepts staging_bucket param in execution_config May 31, 2023
@nickozilla
Copy link
Contributor Author

Hi @dataders have you managed to test/ replicate this bug?

@jtcohen6
Copy link
Contributor

Hey @nickozilla! In which previous version was this working for you, such that it stopped working in v1.5.1?

I'm not sure exactly what's up here, so I'm just going to provide some context (which may already be obvious to you) & share some educated guesses.

As far I can tell, as we're doing is applying whatever config the user (you) are passing in here, and trying to validate it with google.protobuf.json_format.ParseDict before shipping it off to Dataproc:

# Apply configuration from dataproc_batch key, possibly overriding defaults.
if self.credential.dataproc_batch:
self._update_batch_from_config(self.credential.dataproc_batch, batch)
return batch
@classmethod
def _update_batch_from_config(
cls, config_dict: Union[Dict, DataprocBatchConfig], target: dataproc_v1.Batch
):
try:
# updates in place
ParseDict(config_dict, target._pb)
except Exception as e:
docurl = (
"https://cloud.google.com/dataproc-serverless/docs/reference/rpc/google.cloud.dataproc.v1"
"#google.cloud.dataproc.v1.Batch"
)
raise ValueError(
f"Unable to parse dataproc_batch as valid batch specification. See {docurl}. {str(e)}"
) from e

This was implemented in #578, and included in v1.5.0. In older versions (v1.3 + v1.4), I don't think we allowed user configuration for these properties at all. I don't think we've made any changes since.

I agree that, based on the docs, it seems like the config you're supplying matches the expected schema.


Could you try running this code apart from dbt, replacing the env_var calls with your actual values?

from google.cloud import dataproc_v1
from google.protobuf.json_format import ParseDict

batch = dataproc_v1.Batch(
            {
                "runtime_config": dataproc_v1.RuntimeConfig(
                    version="1.1",
                    properties={
                        "spark.executor.instances": "2",
                    },
                )
            }
        )

your_config = """
  environment_config:
    execution_config:
      service_account: "{{ env_var('SERVICE_ACCOUNT') }}"
      subnetwork_uri: "{{ env_var('SUBNET') }}"
      staging_bucket: "{{ env_var('PYTHON_DBT_STAGING_BUCKET') }}"
  pyspark_batch:
    jar_file_uris:
      [
        "gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.13-0.29.0.jar",
      ]
  runtime_config:
    container_image: "europe-docker.pkg.dev/{{ env_var('PYTHON_DBT_CONTAINER') }}"
"""

config_dict = yaml.safe_load(your_config)
ParseDict(config_dict, batch._pb)

That ParseDict is successful for me when I run it as is...


There have been several new releases of protobuf in the past few months, though none of google-cloud-dataproc since March. Could you also run pip freeze and include the versions of those packages?

@jtcohen6 jtcohen6 removed the triage label Jun 27, 2023
@nickozilla
Copy link
Contributor Author

Closing this issue now as we haven't seen it for a while & cannot reproduce it in our current environment, our versions are on

  • dbt-bigquery==1.6.5
  • dbt-core==1.6.5

Thanks for looking into this at the time @jtcohen6

@nickozilla nickozilla closed this as not planned Won't fix, can't repro, duplicate, stale Oct 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working regression
Projects
None yet
Development

No branches or pull requests

3 participants