Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Dataflow: SDK harness disconnected errors #25273

Closed
2 of 15 tasks
buraktokman opened this issue Feb 2, 2023 · 7 comments
Closed
2 of 15 tasks

[Bug]: Dataflow: SDK harness disconnected errors #25273

buraktokman opened this issue Feb 2, 2023 · 7 comments
Labels
bug dataflow done & done Issue has been reviewed after it was closed for verification, followups, etc. P2 python

Comments

@buraktokman
Copy link

What happened?

We have a pipeline to extract embeddings (feature vectors) from images stored in Cloud Storage bucket and insert into a BigQuery table.

We're consistently getting SDK harness sdk-0-1 disconnected. errors when the Dataflow job runs on N1 type VM instances.

Error message from worker: 
Data channel closed, unable to send additional data to SDK sdk-0-3
SDK harness sdk-0-1 disconnected.
SDK harness sdk-0-2 disconnected.
SDK harness sdk-0-0 disconnected.
Data channel closed, unable to receive additional data from SDK sdk-0-3
SDK harness sdk-0-1 disconnected.
SDK harness sdk-0-2 disconnected.
Data channel closed, unable to receive additional data from SDK sdk-0-1

Notes

N2 machines work fine but N1 fails somewhat surprising because N1 is Google-default machine.

  • Jobs run slower on N1 machines and sometimes appear to fail due to these errors.

  • Using a larger VM (more memory, CPU and disk) didn't resolve the errors.

  • We also have another pipeline to extract embeddings from text and using lapse model which has the same errors on both N1 and N2 machines

  • Diagnostics tab: No errors found during this interval.

We're creating DF job templates (Apache Beam 2.40 Python), storing them on Cloud Storage and using API to launch new jobs.

  • We're batching the items before giving them to the stage where embeddings are extracted. Reducing batch size didn't matter.

  • Pipeline option sdk_worker_parallelism changed from 0 (default) to 1 and didn't change anything.

  • Auto-scaling disabled (max_worker=1) and same errors.

  • Reshuffle stage removed from the pipe

    • There are disconnect errors e.g. SDK harness sdk-0-0 disconnected.
      but no data channel errors e.g. The Data channel closed, unable to send additional data to SDK sdk-0-3

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@buraktokman buraktokman changed the title [Bug]: [Bug]: Dataflow: SDK harness disconnected errors Feb 2, 2023
@tvalentyn
Copy link
Contributor

The SDK harness sdk-0-0 disconnected error means that something made the SDK harness process to crash. This is the process that runs the pipeline user code, and where the bulk of processing is happening. The investigation should focus on identifying what causes the crash.
It can be an OOM event, or a crash in a C extension/third party library or something else. If processing a particular element causes the process to crash, those could potentially be filtered out by using .with_exception_handling(use_subprocess=True), see:

use_subprocess=False,

@tvalentyn
Copy link
Contributor

in some cases there may be logs in worker-startup or worker logger preceding the crash.
It's difficult to determine what exactly the rootcause without more information about the pipeline or a reproducible example.

@tvalentyn
Copy link
Contributor

If you'd like someone to have a closer look at the pipeline or logs, you can reach out to Dataflow customer support.

@viniciusdsmello
Copy link

Hello folks, does anyone here have figured out the root cause?

@tvalentyn
Copy link
Contributor

SDK harness sdk-0-1 disconnected messages are symptom, not a root cause. One needs to look at other logs preceeding the disconnection event.

@tvalentyn
Copy link
Contributor

Closing this since there is not enough actionable information on this ticket.

@github-actions github-actions bot added this to the 2.52.0 Release milestone Oct 3, 2023
@liferoad
Copy link
Collaborator

liferoad commented Oct 3, 2023

Better open a Google Cloud support ticket. So the team could have more details to help debug.

@jrmccluskey jrmccluskey added the done & done Issue has been reviewed after it was closed for verification, followups, etc. label Oct 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug dataflow done & done Issue has been reviewed after it was closed for verification, followups, etc. P2 python
Projects
None yet
Development

No branches or pull requests

5 participants