-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Dataflow: SDK harness disconnected errors #25273
Comments
The beam/sdks/python/apache_beam/transforms/core.py Line 1441 in 64e40d2
|
in some cases there may be logs in worker-startup or worker logger preceding the crash. |
If you'd like someone to have a closer look at the pipeline or logs, you can reach out to Dataflow customer support. |
Hello folks, does anyone here have figured out the root cause? |
|
Closing this since there is not enough actionable information on this ticket. |
Better open a Google Cloud support ticket. So the team could have more details to help debug. |
What happened?
We have a pipeline to extract embeddings (feature vectors) from
images
stored in Cloud Storage bucket and insert into a BigQuery table.We're consistently getting
SDK harness sdk-0-1 disconnected.
errors when the Dataflow job runs on N1 type VM instances.Notes
N2 machines work fine but N1 fails somewhat surprising because N1 is Google-default machine.
Jobs run slower on N1 machines and sometimes appear to fail due to these errors.
Using a larger VM (more memory, CPU and disk) didn't resolve the errors.
We also have another pipeline to extract embeddings from
text
and using lapse model which has the same errors on both N1 and N2 machinesDiagnostics tab:
No errors found during this interval.
We're creating DF job templates (Apache Beam 2.40 Python), storing them on Cloud Storage and using API to launch new jobs.
We're batching the items before giving them to the stage where embeddings are extracted. Reducing batch size didn't matter.
Pipeline option
sdk_worker_parallelism
changed from 0 (default) to 1 and didn't change anything.Auto-scaling disabled (
max_worker=1
) and same errors.Reshuffle stage removed from the pipe
SDK harness sdk-0-0 disconnected.
but no data channel errors e.g.
The Data channel closed, unable to send additional data to SDK sdk-0-3
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
The text was updated successfully, but these errors were encountered: