You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I created a simple program which reads from TextIO and writes to TextIO. If the input is a local file this works fine. However if you use a file on GCS the prism runner appears to get stuck trying to read the input over and over again.
Here's an example message it keeps printing out
2024/02/19 15:17:19 INFO Reading from gs://<BUCKET>/hackedlogs.json source=/Users/jlewi/go/pkg/mod/github.com/apache/beam/sdks/[email protected]/go/pkg/beam/io/textio/textio.go:226 time=2024-02-19T23:17:19.971Z worker.ID=job-001[go-job-1-1708384631797458000]_go worker.endpoint=localhost:65326
This is using the GoLang SDK v2.54.0.
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
Component: Python SDK
Component: Java SDK
Component: Go SDK
Component: Typescript SDK
Component: IO connector
Component: Beam YAML
Component: Beam examples
Component: Beam playground
Component: Beam katas
Component: Website
Component: Spark Runner
Component: Flink Runner
Component: Samza Runner
Component: Twister2 Runner
Component: Hazelcast Jet Runner
Component: Google Cloud Dataflow Runner
The text was updated successfully, but these errors were encountered:
This problem seems to have happened starting from 2.50 (the Prism rollout) and is the same all the way until 2.54 (I tried all versions down to 2.49 to see it working)
It shouldn't matter to prism that the file is in GCS. I could see perhaps an issue when running docker workers, but prism usually runs Go SDK pipelines in loopback mode.
Most likely the additional latency to GCS is causing splits to go haywire.
We did have #29968 which reduced the split aggression, but that made it into 2.54.0, so something else is at play (or the initial latency hit is very significant).
This pipeline is roughly equivalent to the Wordcount example ( beam/sdks/go/examples/wordcount/wordcount.go) which does work.
But the repro pipeline instead has an anonymous function as a DoFn. Those don't work on portable runners, and are not meaningfully supportable. If the original tests were on the Go Direct Runner (which would be the case in 2.54.0), then they'd have worked, but only because the Direct runner doesn't serialize anything.
I suspect that's the root cause of the issue in this case, and this is validated once I move the inlined function to a registered, named function call like the following snippet does:
Unfortunately for us, it's not presently possible in Go to detect if a function is an anonymous function or not, outside of doing static analysis on the source code. Were I to redesign the SDK, I wouldn't permit simple functions like that as DoFn, as they cause more issues than they solve, or I'd provide a more closure/anonymous safe way of building pipelines to enable them, and avoid registrations entirely.
What happened?
There's a full reproduction here:
https://github.com/jlewi/beambugs/tree/main/prismgcs
I created a simple program which reads from TextIO and writes to TextIO. If the input is a local file this works fine. However if you use a file on GCS the prism runner appears to get stuck trying to read the input over and over again.
Here's an example message it keeps printing out
This is using the GoLang SDK v2.54.0.
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
The text was updated successfully, but these errors were encountered: