Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The PostCommit Python Examples Flink job is flaky #32794

Closed
github-actions bot opened this issue Oct 16, 2024 · 7 comments · Fixed by #33135 or #33138
Closed

The PostCommit Python Examples Flink job is flaky #32794

github-actions bot opened this issue Oct 16, 2024 · 7 comments · Fixed by #33135 or #33138

Comments

@github-actions
Copy link
Contributor

The PostCommit Python Examples Flink is failing over 50% of the time.
Please visit https://github.com/apache/beam/actions/workflows/beam_PostCommit_Python_Examples_Flink.yml?query=is%3Afailure+branch%3Amaster to see all failed workflow runs.
See also Grafana statistics: http://metrics.beam.apache.org/d/CTYdoxP4z/ga-post-commits-status?orgId=1&viewPanel=8&var-Workflow=PostCommit%20Python%20Examples%20Flink

@liferoad
Copy link
Collaborator

INFO     apache_beam.utils.subprocess_server:subprocess_server.py:213 Caused by: java.io.IOException: Insufficient number of network buffers: required 16, but only 8 available. The total number of network buffers is currently set to 2048 of 32768 bytes each. You can increase this number by setting the configuration keys 'taskmanager.memory.network.fraction', 'taskmanager.memory.network.min', and 'taskmanager.memory.network.max'.
INFO     apache_beam.utils.subprocess_server:subprocess_server.py:213 	at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.internalCreateBufferPool(NetworkBufferPool.java:495)
INFO     apache_beam.utils.subprocess_server:subprocess_server.py:213 	at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.createBufferPool(NetworkBufferPool.java:468)
INFO     apache_beam.utils.subprocess_server:subprocess_server.py:213 	at org.apache.flink.runtime.io.network.partition.ResultPartitionFactory.lambda$createBufferPoolFactory$1(ResultPartitionFactory.java:379)
INFO     apache_beam.utils.subprocess_server:subprocess_server.py:213 	at org.apache.flink.runtime.io.network.partition.ResultPartition.setup(ResultPartition.java:158)
INFO     apache_beam.utils.subprocess_server:subprocess_server.py:213 	at org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:969)
INFO     apache_beam.utils.subprocess_server:subprocess_server.py:213 	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:658)
INFO     apache_beam.utils.subprocess_server:subprocess_server.py:213 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566)
INFO     apache_beam.utils.subprocess_server:subprocess_server.py:213 	at java.base/java.lang.Thread.run(Thread.java:829)

Copy link
Contributor Author

Reopening since the workflow is still flaky

@liferoad
Copy link
Collaborator

No useful logs from the failed workflows.

@liferoad
Copy link
Collaborator

Looks like some mem issue:

The node was low on resource: memory. Threshold quantity: 100Mi, available: 74432Ki. Container runner was using 58152720Ki, request is 3Gi, has larger consumption of memory. Container docker was using 43668Ki, request is 0, has larger consumption of memory. 
image

@liferoad
Copy link
Collaborator

This keeps failing now here

2024-11-18T14:26:34.5015302Z apache_beam/examples/cookbook/bigquery_tornadoes_it_test.py::BigqueryTornadoesIT::test_bigquery_tornadoes_it

@liferoad
Copy link
Collaborator

Even with higmem:

The node was low on resource: memory. Threshold quantity: 100Mi, available: 3568Ki. Container docker was using 42360Ki, request is 0, has larger consumption of memory. Container runner was using 59928716Ki, request is 5Gi, has larger consumption of memory. 

@liferoad
Copy link
Collaborator

https://github.com/apache/beam/actions/runs/11915340857/job/33205368335 looks good now when switiching to the higher mem machines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment