[Bug]: Python JDBC IO Try To Connect RDB Before Deploying #23029

case-k-git · 2022-09-04T09:00:47Z

What happened?

When I tried to deploy python jdbc pipeline to dataflow from my local env, failed to deploy into dataflow and got connection error. seems to be python jdbc io trying to connect database from local env not only dataflow env.

I have checked connection and find trying to make connection from my pc.database can only accepting connection inside from dataflow net work so got connection error.

I have also checked java jdbc version and it worked fine. so python versions this behavior must be bug

class PostgresToBigQueryDataflow():

    def __init__(self):
        self._username = '<username>'
        self._password = '<password>'
        self._driver_class_name = 'org.postgresql.Driver'
        self._query = "select id from beam_table;"
        self._jdbc_url = 'jdbc:postgresql://<private_IP>:5432/beam'
        self._project = '<project id>'
        self._dataset = '<dataset>'
        self._table = '<table>'
        self._options = DebugOptions([
            "--runner=DataflowRunner",
            "--project=<project id>",
            "--job_name=<job name>",
            "--temp_location=gs://<project id>/tmp/",
            "--region=us-central1",
            "--experiments=use_runner_v2",
            "--subnetwork=regions/us-central1/subnetworks/<subnet>",
        ])
    def test(self):
        JdbcToBigQuery(self._username, self._password, self._driver_class_name, self._query, self._jdbc_url, self._project, self._dataset,self._table, self._options).run()

Issue Priority

Priority: 2

Issue Component

Component: cross-language

The text was updated successfully, but these errors were encountered:

Abacn · 2022-09-09T14:27:33Z

Could you please share the error message seen when deploying the pipeline to Dataflow?

I did some local test and see the following error when cannot connect to jdbc database:

INFO:apache_beam.utils.subprocess_server:WARNING: Configuration class 'org.apache.beam.sdk.extensions.schemaio.expansion.ExternalSchemaIOTransformRegistrar$Configuration' has no schema registered. Attempting to construct with setter approach.
Traceback (most recent call last):
  File "jdbcioTest.py", line 180, in <module>
    test_instance.run_read()
  File "jdbcioTest.py", line 157, in run_read
    p
  File "/Users/yathu/dev/virtualenv/py38beam/lib/python3.8/site-packages/apache_beam/transforms/ptransform.py", line 1095, in __ror__
    return self.transform.__ror__(pvalueish, self.label)
  File "/Users/yathu/dev/virtualenv/py38beam/lib/python3.8/site-packages/apache_beam/transforms/ptransform.py", line 617, in __ror__
    result = p.apply(self, pvalueish, label)
  File "/Users/yathu/dev/virtualenv/py38beam/lib/python3.8/site-packages/apache_beam/pipeline.py", line 663, in apply
    return self.apply(transform, pvalueish)
  File "/Users/yathu/dev/virtualenv/py38beam/lib/python3.8/site-packages/apache_beam/pipeline.py", line 709, in apply
    pvalueish_result = self.runner.apply(transform, pvalueish, self._options)
  File "/Users/yathu/dev/virtualenv/py38beam/lib/python3.8/site-packages/apache_beam/runners/runner.py", line 185, in apply
    return m(transform, input, options)
  File "/Users/yathu/dev/virtualenv/py38beam/lib/python3.8/site-packages/apache_beam/runners/runner.py", line 215, in apply_PTransform
    return transform.expand(input)
  File "/Users/yathu/dev/virtualenv/py38beam/lib/python3.8/site-packages/apache_beam/transforms/external.py", line 526, in expand
    raise RuntimeError(response.error)
RuntimeError: org.apache.beam.sdk.io.jdbc.BeamSchemaInferenceException: Failed to infer Beam schema
	at org.apache.beam.sdk.io.jdbc.JdbcIO$ReadRows.inferBeamSchema(JdbcIO.java:696)
	at org.apache.beam.sdk.io.jdbc.JdbcIO$ReadRows.expand(JdbcIO.java:672)
	at org.apache.beam.sdk.io.jdbc.JdbcIO$ReadRows.expand(JdbcIO.java:592)
	at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:548)
...

If this is also what you see, what happens is that the external transform is trying to infer schema by connecting to the database at pipeline expansion time, which happens only in external transform expansion service. Will investigate whether it is possible or how can avoid it.

Abacn · 2022-09-09T14:29:20Z

.remove-labels "awaiting triage"

github-actions · 2022-09-09T15:59:19Z

Label "awaiting cannot be managed because it does not exist in the repo. Please check your spelling.

Abacn · 2022-09-09T16:00:22Z

.remove-labels 'awaiting triage'

Abacn · 2022-09-24T01:34:57Z

Expansion service tries to get the schema by connecting the jdbc server. Using Java SDK does not go to the expansion service so it did not. However I agree that it is reasonable that could defer the process.

CC: @robertwb There was some discussion of defer the expansion service. It could benefit this use case if implemented. Or is there other solution?

mataralhawiti · 2024-05-30T09:52:43Z

I'm facing the same issue where it tries to infer schema during pipeline submission from local machine (which doesn't have access to DB server).

hakimkac99 · 2024-11-21T12:49:26Z

Hi @case-k-git any updates ? did you find a solution ?

RhysGrimshaw · 2024-12-05T10:24:41Z

I'm facing the same issue where it tries to infer schema during pipeline submission from local machine (which doesn't have access to DB server).

Hi Matar,
Did you ever manage to find a fix for this? We are running into the exact same issue.

mataralhawiti · 2024-12-05T13:49:10Z

I'm facing the same issue where it tries to infer schema during pipeline submission from local machine (which doesn't have access to DB server).

Hi Matar, Did you ever manage to find a fix for this? We are running into the exact same issue.

hey @RhysGrimshaw As of now, the only option is you must open the connection between the machine submitting (you local machine if you submitting manually or your Dataflow VMs subnet)the job and DB server. For example, in my case I opened the connection from our dataflow subnet to the DB server.

RhysGrimshaw · 2024-12-05T14:41:10Z

I'm facing the same issue where it tries to infer schema during pipeline submission from local machine (which doesn't have access to DB server).

Hi Matar, Did you ever manage to find a fix for this? We are running into the exact same issue.

hey @RhysGrimshaw As of now, the only option is you must open the connection between the machine submitting (you local machine if you submitting manually or your Dataflow VMs subnet)the job and DB server. For example, in my case I opened the connection from our dataflow subnet to the DB server.

Thank you for getting back to me. How do you go about running the a Python script directly from the Subnet? The only options I seem to have available are from a template/builder, or by reusing a Dataflow Job. But since my job fails before it gets to Dataflow (due to it trying to infer schema from a local machine connection) I can't build a new job from this.

mataralhawiti · 2024-12-05T14:45:09Z

I'm facing the same issue where it tries to infer schema during pipeline submission from local machine (which doesn't have access to DB server).

Hi Matar, Did you ever manage to find a fix for this? We are running into the exact same issue.

hey @RhysGrimshaw As of now, the only option is you must open the connection between the machine submitting (you local machine if you submitting manually or your Dataflow VMs subnet)the job and DB server. For example, in my case I opened the connection from our dataflow subnet to the DB server.

Thank you for getting back to me. How do you go about running the a Python script directly from the Subnet? The only options I seem to have available are from a template/builder, or by reusing a Dataflow Job. But since my job fails before it gets to Dataflow (due to it trying to infer schema from a local machine connection) I can't build a new job from this.

@RhysGrimshaw for testing purpose, I opened the connection from my local machine to DB server. Production wise, we submit our jobs automatically via cloud schudeler & cloud composer. So the point to remember is your dataflow vm needs to be able to access the DB server for initial schema inference

case-k-git added awaiting triage bug labels Sep 4, 2022

github-actions bot added cross-language P2 labels Sep 4, 2022

github-actions bot removed the awaiting triage label Sep 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Python JDBC IO Try To Connect RDB Before Deploying #23029

[Bug]: Python JDBC IO Try To Connect RDB Before Deploying #23029

case-k-git commented Sep 4, 2022 •

edited

Loading

Abacn commented Sep 9, 2022

Abacn commented Sep 9, 2022

github-actions bot commented Sep 9, 2022

Abacn commented Sep 9, 2022

Abacn commented Sep 24, 2022

mataralhawiti commented May 30, 2024

hakimkac99 commented Nov 21, 2024

RhysGrimshaw commented Dec 5, 2024

mataralhawiti commented Dec 5, 2024

RhysGrimshaw commented Dec 5, 2024

mataralhawiti commented Dec 5, 2024

[Bug]: Python JDBC IO Try To Connect RDB Before Deploying #23029

[Bug]: Python JDBC IO Try To Connect RDB Before Deploying #23029

Comments

case-k-git commented Sep 4, 2022 • edited Loading

What happened?

Issue Priority

Issue Component

Abacn commented Sep 9, 2022

Abacn commented Sep 9, 2022

github-actions bot commented Sep 9, 2022

Abacn commented Sep 9, 2022

Abacn commented Sep 24, 2022

mataralhawiti commented May 30, 2024

hakimkac99 commented Nov 21, 2024

RhysGrimshaw commented Dec 5, 2024

mataralhawiti commented Dec 5, 2024

RhysGrimshaw commented Dec 5, 2024

mataralhawiti commented Dec 5, 2024

case-k-git commented Sep 4, 2022 •

edited

Loading