Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataflowPipelineJob.waitUntilFinish() crashes when it has created a template. #20106

Open
damccorm opened this issue Jun 4, 2022 · 3 comments

Comments

@damccorm
Copy link
Contributor

damccorm commented Jun 4, 2022


INFO: Template successfully created.

Exception in thread "main" java.lang.UnsupportedOperationException:
The result of template creation should not be used.
        at org.apache.beam.runners.dataflow.util.DataflowTemplateJob.getJobId(DataflowTemplateJob.java:37)

       at org.apache.beam.runners.dataflow.DataflowPipelineJob.getJobWithRetries(DataflowPipelineJob.java:524)

       at org.apache.beam.runners.dataflow.DataflowPipelineJob.getStateWithRetries(DataflowPipelineJob.java:506)

       at org.apache.beam.runners.dataflow.DataflowPipelineJob.waitUntilFinish(DataflowPipelineJob.java:295)

       at org.apache.beam.runners.dataflow.DataflowPipelineJob.waitUntilFinish(DataflowPipelineJob.java:224)

       at org.apache.beam.runners.dataflow.DataflowPipelineJob.waitUntilFinish(DataflowPipelineJob.java:183)

       at org.apache.beam.runners.dataflow.DataflowPipelineJob.waitUntilFinish(DataflowPipelineJob.java:176)

This is a real error. If a template was created, the job is complete. Instead of crashing by tried to access the job id, as though DataflowPipelineJob doesn't know it made a template, it should instead return successfully. Or perhaps there is another design choice. But just crashes does not make sense. Probably DataflowRunner should not return a DataflowPipelineJob at all in this way.

Imported from Jira BEAM-9337. Original Jira may contain additional context.
Reported by: kenn.

@CodingAnarchy
Copy link

We recently ran into this error when trying to do some after batch cleanup in some of our GCP Dataflow jobs when using a template, and had to revert those changes. Not being able to use waitUntilFinish() to defer cleanup of some of the intermediate resources and do some post pipeline work makes it a bit more difficult to manage these pipelines.

@djaneluz
Copy link

Just faced the same problem with BEAM version 2.51.0, any updates?

@RonBarkan
Copy link

Running a pipeline, then running another pipeline or even just some bit of non-pipeline code, when the first is finished, is not possible with this behavior.
Why is this important / valid use case:

  1. Many pipelines end in writing some output somewhere as a terminal state, so you cannot chain another step past it. However, if you want to do something, such as write a marker file, after all output has completed, you must do it after the pipeline ends.
    You can only do this if you waitUntilFinish().
  2. If you collect metrics from the run to store them or process them in some way, you probably want to:
    var result = pipeline.run();
    result.waitUntilFinish();
    processMetrics(result.metrics());

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants