Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The PreCommit GoPortable job is flaky #32324

Closed
github-actions bot opened this issue Aug 27, 2024 · 8 comments · Fixed by #32449
Closed

The PreCommit GoPortable job is flaky #32324

github-actions bot opened this issue Aug 27, 2024 · 8 comments · Fixed by #32449

Comments

@github-actions
Copy link
Contributor

The PreCommit GoPortable is failing over 50% of the time.
Please visit https://github.com/apache/beam/actions/workflows/beam_PreCommit_GoPortable.yml?query=is%3Afailure+branch%3Amaster to see all failed workflow runs.
See also Grafana statistics: http://metrics.beam.apache.org/d/CTYdoxP4z/ga-post-commits-status?orgId=1&viewPanel=10&var-Workflow=PreCommit%20GoPortable

@tvalentyn
Copy link
Contributor

Some tests pass, some fail:

=== RUN   TestDirectParDo	
2024/08/27 13:27:42 components:{transforms:{key:"e1"  value:{unique_name:"Impulse"  spec:{urn:"beam:transform:impulse:v1"}  outputs:{key:"i0"  value:"n1"}}}  transforms:{key:"e2"  value:{unique_name:"beam.createFn"  spec:{urn:"beam:transform:pardo:v1"  payload:"\n\x86\x02\n\x19beam:go:transform:dofn:v1\x1a\xe8\x01ChliZWFtOmdvOnRyYW5zZm9ybTpkb2ZuOnYxEo8BCm4SOwgYEjcIGkozZ2l0aHViLmNvbS9hcGFjaGUvYmVhbS9zZGtzL3YyL2dvL3BrZy9iZWFtLmNyZWF0ZUZuGi97InZhbHVlcyI6WyJBUUk9IiwiQVFRPSIsIkFRWT0iXSwidHlwZSI6IkNBST0ifRIMCAESCAoGCBQSAggIGggKBgoECBlADyIFUGFyRG8="}  inputs:{key:"i0"  value:"n1"}  outputs:{key:"i0"  value:"n2"}  environment_id:"go"}}  transforms:{key:"e3"  value:{unique_name:"regression.directFn"  spec:{urn:"beam:transform:pardo:v1"  payload:"\n\xce\x01\n\x19beam:go:transform:dofn:v1\x1a\xb0\x01ChliZWFtOmdvOnRyYW5zZm9ybTpkb2ZuOnYxEmUKSgpICjpnaXRodWIuY29tL2FwYWNoZS9iZWFtL3Nka3MvdjIvZ28vdGVzdC9yZWdyZXNzaW9uLmRpcmVjdEZuEgoIFiICCAIqAggCEggIARIECgIIAhoGCgQKAggCIgVQYXJEbw=="}  inputs:{key:"i0"  value:"n2"}  outputs:{key:"i0"  value:"n3"}  environment_id:"go"}}  transforms:{key:"e4"  value:{unique_name:"passert.Sum(direct)/passert.NonEmpty/Impulse"  spec:{urn:"beam:transform:impulse:v1"}  outputs:{key:"i0"  value:"n4"}}}  transforms:{key:"e5"  value:{unique_name:"passert.Sum(direct)/passert.NonEmpty/passert.nonEmptyFn"  spec:{urn:"beam:transform:pardo:v1"  payload:"\n\xe2\x01\n\x19beam:go:transform:dofn:v1\x1a\xc4\x01ChliZWFtOmdvOnRyYW5zZm9ybTpkb2ZuOnYxEnYKUxJNCBgSSQgaSkVnaXRodWIuY29tL2FwYWNoZS9iZWFtL3Nka3MvdjIvZ28vcGtnL2JlYW0vdGVzdGluZy9wYXNzZXJ0Lm5vbkVtcHR5Rm4aAnt9EgwIARIICgYIFBICCAgSCggGEgYKBAgZQBUiBVBhckRv\x1aO\n\x02i1\x12I\n\x1d\n\x1bbeam:side_input:iterable:v1\x12\x05\n\x03foo\x1a!\n\x1fbeam:go:windowmapping:global:v1"}  inputs:{key:"i0"  value:"n4"}  inputs:{key:"i1"  value:"n3"}  environment_id:"go"}}  transforms:{key:"e6"  value:{unique_name:"passert.Sum(direct)/beam.addFixedKeyFn"  spec:{urn:"beam:transform:pardo:v1"  payload:"\n\xee\x01\n\x19beam:go:transform:dofn:v1\x1a\xd0\x01ChliZWFtOmdvOnRyYW5zZm9ybTpkb2ZuOnYxEn0KUApOCjhnaXRodWIuY29tL2FwYWNoZS9iZWFtL3Nka3MvdjIvZ28vcGtnL2JlYW0uYWRkRml4ZWRLZXlGbhISCBYiBAgZQA8qAggCKgQIGUAPEgoIARIGCgQIGUAPGhYKFAoECBlACxIECgIIAhIGCgQIGUAPIgVQYXJEbw=="}  inputs:{key:"i0"  value:"n3"}  outputs:{key:"i0"  value:"n5"}  environment_id:"go"}}  transforms:{key:"e7"  value:{unique_name:"passert.Sum(direct)/CoGBK"  spec:{urn:"beam:transform:group_by_key:v1"}  inputs:{key:"i0"  value:"n5"}  outputs:{key:"i0"  value:"n6"}}}  transforms:{key:"e8"  value:{unique_name:"passert.Sum(direct)/passert.sumFn"  spec:{urn:"beam:transform:pardo:v1"  payload:"\n\x86\x02\n\x19beam:go:transform:dofn:v1\x1a\xe8\x01ChliZWFtOmdvOnRyYW5zZm9ybTpkb2ZuOnYxEo8BCm4SSAgYEkQIGkpAZ2l0aHViLmNvbS9hcGFjaGUvYmVhbS9zZGtzL3YyL2dvL3BrZy9iZWFtL3Rlc3RpbmcvcGFzc2VydC5zdW1GbhoieyJuYW1lIjoiZGlyZWN0Iiwic2l6ZSI6Mywic3VtIjo5fRIWCAESEgoECBlADRIECgIIAhIECgIIAiIFUGFyRG8="}  inputs:{key:"i0"  value:"n6"}  environment_id:"go"}}  transforms:{key:"s1"  value:{unique_name:"passert.Sum(direct)"  subtransforms:"s2"  subtransforms:"e6"  subtransforms:"e7"  subtransforms:"e8"  inputs:{key:"n3"  value:"n3"}  environment_id:"go"}}  transforms:{key:"s2"  value:{unique_name:"passert.Sum(direct)/passert.NonEmpty"  subtransforms:"e4"  subtransforms:"e5"  inputs:{key:"n3"  value:"n3"}  environment_id:"go"}}  pcollections:{key:"n1"  value:{unique_name:"n1"  coder_id:"c0"  is_bounded:BOUNDED  windowing_strategy_id:"w0"}}  pcollections:{key:"n2"  value:{unique_name:"n2"  coder_id:"c3"  is_bounded:BOUNDED  windowing_strategy_id:"w0"}}  pcollections:{key:"n3"  value:{unique_name:"n3"  coder_id:"c3"  is_bounded:BOUNDED  windowing_strategy_id:"w0"}}  pcollections:{key:"n4"  value:{unique_name:"n4"  coder_id:"c0"  is_bounded:BOUNDED  windowing_strategy_id:"w0"}}  pcollections:{key:"n5"  value:{unique_name:"n5"  coder_id:"c4"  is_bounded:BOUNDED  windowing_strategy_id:"w0"}}  pcollections:{key:"n6"  value:{unique_name:"n6"  coder_id:"c6"  is_bounded:BOUNDED  windowing_strategy_id:"w0"}}  windowing_strategies:{key:"w0"  value:{window_fn:{urn:"beam:window_fn:global_windows:v1"}  merge_status:NON_MERGING  window_coder_id:"c1"  trigger:{default:{}}  accumulation_mode:DISCARDING  output_time:END_OF_WINDOW  closing_behavior:EMIT_IF_NONEMPTY  on_time_behavior:FIRE_IF_NONEMPTY  environment_id:"go"}}  coders:{key:"c0"  value:{spec:{urn:"beam:coder:bytes:v1"}}}  coders:{key:"c1"  value:{spec:{urn:"beam:coder:global_window:v1"}}}  coders:{key:"c2"  value:{spec:{urn:"beam:go:coder:custom:v1"  payload:"Cgd2YXJpbnR6EgIIAhpdCklnaXRodWIuY29tL2FwYWNoZS9iZWFtL3Nka3MvdjIvZ28vcGtnL2JlYW0vY29yZS9ydW50aW1lL2NvZGVyeC5lbmNWYXJJbnRaEhAIFiIECBlADyoGCBQSAggIImkKSWdpdGh1Yi5jb20vYXBhY2hlL2JlYW0vc2Rrcy92Mi9nby9wa2cvYmVhbS9jb3JlL3J1bnRpbWUvY29kZXJ4LmRlY1ZhckludFoSHAgWIgQIGUADIgYIFBICCAgqBAgZQA8qBAgZQAE="}}}  coders:{key:"c3"  value:{spec:{urn:"beam:coder:length_prefix:v1"}  component_coder_ids:"c2"}}  coders:{key:"c4"  value:{spec:{urn:"beam:coder:kv:v1"}  component_coder_ids:"c3"  comp…	
2024/08/27 13:27:42 Cross-compiling /runner/_work/beam/beam/sdks/go/test/regression/pardo_test.go with GOOS=linux GOARCH=amd64 CGO_ENABLED=0 as /tmp/worker-8-1724765262974620701	
2024/08/27 13:27:46 Prepared job with id: go-testdirectpardo-459-937728f0-4be1-4d6b-aeaa-4925e4142ebe and staging token: go-testdirectpardo-459-937728f0-4be1-4d6b-aeaa-4925e4142ebe	
2024/08/27 13:27:46 Staged binary artifact with token: go-testdirectpardo-459-937728f0-4be1-4d6b-aeaa-4925e4142ebe	
2024/08/27 13:27:46 Submitted job: go-testdirectpardo-459-937728f0-4be1-4d6b-aeaa-4925e4142ebe	
2024/08/27 13:27:46 2024-08-27 13:27:46. (1): starting control server on port 35099	
2024/08/27 13:27:46 2024-08-27 13:27:46. (2): starting data server on port 35181	
2024/08/27 13:27:46 2024-08-27 13:27:46. (3): starting state server on port 37247	
2024/08/27 13:27:46 2024-08-27 13:27:46. (4): starting logging server on port 46831	
2024/08/27 13:27:46 Job[go-testdirectpardo-459-937728f0-4be1-4d6b-aeaa-4925e4142ebe] state: STOPPED	
2024/08/27 13:27:46 Job[go-testdirectpardo-459-937728f0-4be1-4d6b-aeaa-4925e4142ebe] state: STARTING	
2024/08/27 13:27:46 Job[go-testdirectpardo-459-937728f0-4be1-4d6b-aeaa-4925e4142ebe] state: RUNNING	
2024/08/27 13:27:46 2024-08-27 13:27:46. (5): Attempting to pull image apache/beam_go_sdk:dev	
2024/08/27 13:27:46 2024-08-27 13:27:46. (6): Unable to pull image apache/beam_go_sdk:dev, defaulting to local image if it exists	
2024/08/27 13:27:51 2024-08-27 13:27:51. (7): Waiting for docker to start up. Current status is exited	
2024/08/27 13:27:54 2024-08-27 13:27:54. (8): Traceback (most recent call last):	
  File "/runner/_work/beam/beam/sdks/python/apache_beam/runners/portability/local_job_service.py", line 294, in _run_job	
    self.result = self._invoke_runner()	
  File "/runner/_work/beam/beam/sdks/python/apache_beam/runners/portability/local_job_service.py", line 316, in _invoke_runner	
    return fn_runner.FnApiRunner(	
  File "/runner/_work/beam/beam/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 221, in run_via_runner_api	
    return self.run_stages(stage_context, stages)	
  File "/runner/_work/beam/beam/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 443, in run_stages	
    runner_execution_context.setup()	
  File "/runner/_work/beam/beam/sdks/python/apache_beam/runners/portability/fn_api_runner/execution.py", line 775, in setup	
    self._enqueue_stage_initial_inputs(stage)	
  File "/runner/_work/beam/beam/sdks/python/apache_beam/runners/portability/fn_api_runner/execution.py", line 804, in _enqueue_stage_initial_inputs	
    bundle_manager.data_api_service_descriptor())	
  File "/runner/_work/beam/beam/sdks/python/apache_beam/runners/portability/fn_api_runner/execution.py", line 1046, in data_api_service_descriptor	
    return self.worker_handlers[0].data_api_service_descriptor()	
  File "/runner/_work/beam/beam/sdks/python/apache_beam/runners/portability/fn_api_runner/execution.py", line 1038, in worker_handlers	
    self.execution_context.worker_handler_manager.get_worker_handlers(	
  File "/runner/_work/beam/beam/sdks/python/apache_beam/runners/portability/fn_api_runner/worker_handlers.py", line 930, in get_worker_handlers	
    worker_handler.start_worker()	
  File "/runner/_work/beam/beam/sdks/python/apache_beam/runners/portability/fn_api_runner/worker_handlers.py", line 807, in start_worker	
    raise RuntimeError(	
RuntimeError: SDK failed to start. Final status is exited

@tvalentyn
Copy link
Contributor

Tried to repro locally:

:beam$ gradlew :sdks:go:test:ulrValidatesRunner

Suite seems to be stuck after 2hrs periodically emitting logs:

I0000 00:00:1724792615.011573  196526 fork_posix.cc:77] Other threads are currently calling into gRPC, skipping fork() handlers
I0000 00:00:1724792615.050536  196526 fork_posix.cc:77] Other threads are currently calling into gRPC, skipping fork() handlers
INFO:apache_beam.runners.portability.fn_api_runner.worker_handlers:b'\tat org.apache.beam.vendor.grpc.v1p60p1.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)\n\tat org.apache.beam.vendor.grpc.v1p60p1.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:800)\n\tat org.apache.beam.vendor.grpc.v1p60p1.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:509)\n\tat org.apache.beam.vendor.grpc.v1p60p1.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:407)\n\tat org.apache.beam.vendor.grpc.v1p60p1.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)\n\tat org.apache.beam.vendor.grpc.v1p60p1.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat org.apache.beam.vendor.grpc.v1p60p1.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\n\tat java.base/java.lang.Thread.run(Thread.java:829)\nAug 27, 2024 7:24:03 PM org.apache.beam.fn.harness.FnHarness main\nINFO: Shutting SDK harness down.\n'
I0000 00:00:1724792621.584793  196526 fork_posix.cc:77] Other threads are currently calling into gRPC, skipping fork() handlers
I0000 00:00:1724792621.626991  196526 fork_posix.cc:77] Other threads are currently calling into gRPC, skipping fork() handlers
INFO:apache_beam.runners.portability.fn_api_runner.worker_handlers:b'\tat org.apache.beam.vendor.grpc.v1p60p1.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)\n\tat org.apache.beam.vendor.grpc.v1p60p1.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:800)\n\tat org.apache.beam.vendor.grpc.v1p60p1.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:509)\n\tat org.apache.beam.vendor.grpc.v1p60p1.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:407)\n\tat org.apache.beam.vendor.grpc.v1p60p1.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)\n\tat org.apache.beam.vendor.grpc.v1p60p1.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat org.apache.beam.vendor.grpc.v1p60p1.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\n\tat java.base/java.lang.Thread.run(Thread.java:829)\nAug 27, 2024 7:24:03 PM org.apache.beam.fn.harness.FnHarness main\nINFO: Shutting SDK harness down.\n'
<============-> 99% EXECUTING [1h 57m 42s]

@lostluck
Copy link
Contributor

It looks like it wasn't able to find the Go SDK container?

2024-08-23T13:25:39.8797158Z Error response from daemon: manifest for apache/beam_go_sdk:dev not found: manifest unknown: manifest unknown

Very weird.

That's not it though, since I see the same in the last successful run.

Both oddly still call the container without the docker.io domain, so I don't think that's it.

The output looks like the SDK container is starting up, and the SDK itself is starting up, but then it's unable to successfully connect to the runner?

control.Recv failed\n\tcaused by:\nrpc error: code = Unavailable desc = error reading from server: EOF\n'

This is perplexing. I can't find any smoking gun change in around when the failures started. That implies the issue is outside of the repo somehow, which could be related to our GHA images.

@tvalentyn
Copy link
Contributor

Both oddly still call the container without the docker.io domain, so I don't think that's it.

I think that's WAI:

DefaultDockerImage = "apache/beam_go_sdk:" + SdkVersion

@tvalentyn
Copy link
Contributor

Tests fail inconsistently, but entire workflow fails reliably. Occasionally, test don't fail but get stuck in and time out. Here is a sample of failures from past runs:

2024-08-24T13:22:36.8054401Z --- FAIL: TestWindowSums_Lifted (3.23s)
2024-08-24T13:22:36.9149859Z --- FAIL: TestSplittablePipeline (9.33s)
2024-08-24T13:22:42.8219296Z --- FAIL: TestLPErrorReshufflePipeline_DoFnSideInput (2.08s)

:golang$ cat old2.txt | grep "FAIL:"
2024-08-25T13:22:45.0164247Z --- FAIL: TestCoGBK (42.86s)
2024-08-25T13:22:45.5592314Z --- FAIL: TestXLang_CoGroupBy (14.81s)
2024-08-25T13:22:45.6498402Z --- FAIL: TestXLang_CombineGlobally (14.99s)
2024-08-25T13:22:45.6694315Z --- FAIL: TestXLang_Flatten (14.41s)
2024-08-25T13:22:50.0645128Z --- FAIL: TestLPErrorReshufflePipeline_DoFnSideInput (6.84s)

:golang$ cat logs.txt | grep "FAIL:"
2024-09-12T00:49:49.8715264Z --- FAIL: TestValueStateWindowed (7.19s)
2024-09-12T00:49:49.9010459Z --- FAIL: TestValueStateClear (6.43s)
2024-09-12T00:49:50.2949703Z --- FAIL: TestWordCount (88.29s)
2024-09-12T00:49:50.4352043Z --- FAIL: TestLPErrorReshufflePipeline_DoFnSideInput (7.30s)
2024-09-12T00:49:50.4664301Z --- FAIL: TestDirectParDo (8.64s)

:golang$ cat old3.txt | grep FAIL:
2024-09-06T01:48:12.1061249Z --- FAIL: TestWordCount (88.37s)
2024-09-06T01:48:12.3546391Z --- FAIL: TestXLang_Multi (14.78s)

:golang$ cat old4.txt | grep FAIL:
2024-09-05T07:21:03.0592007Z --- FAIL: TestParDoSideInput (12.46s)
2024-09-05T07:21:03.3917563Z --- FAIL: TestWordCount (89.06s)
2024-09-05T07:21:03.4154220Z --- FAIL: TestXLang_Prefix (21.94s)
2024-09-05T07:21:07.9256524Z --- FAIL: TestLPErrorPipeline (12.14s)
2024-09-05T07:21:08.0199833Z --- FAIL: TestMultiEmitParDo (12.24s)

Last successful run was around August 24: https://github.com/apache/beam/actions/workflows/beam_PreCommit_GoPortable.yml?query=is%3Asuccess+branch%3Amaster

@tvalentyn
Copy link
Contributor

Will try to downgrade GRPC IO, looks like it had a release around Aug 22

https://pypi.org/project/grpcio/#history

@lostluck
Copy link
Contributor

Will try to downgrade GRPC IO, looks like it had a release around Aug 22

https://pypi.org/project/grpcio/#history

Agreed. The failures seem largely GRPC related.

@tvalentyn
Copy link
Contributor

tvalentyn commented Sep 12, 2024

successful run: https://github.com/apache/beam/actions/runs/10839063060/job/30078643883?pr=32446

after placing an upper bound on grpcio: ddff883227ffa064ecaee1f3d308675da103812f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants