You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Unassigning myself but relaying my research on this ticket.
Situation
This workflow's test failed roughly every 2 to 3 days in the past two weeks.
Background
This workflow is scheduled to run twice daily. Recent inspection of the latest failures shows that a timeout (Failed: Timeout >1800.0s) when the actual Dataflow Job for that execution succeeded. The stack trace of each failure is not the same for the past two weeks' failures. In each build scans' timeline we see that :sdks:python:test-suites:dataflow:py39:runPerformanceTest takes approximately 30m cutting off at the configured timeout.
Said timeout is set on the runPerformanceTest gradle task per https://github.com/pytest-dev/pytest-timeout. Dataflow Jobs for these failed tests take approximately 10 to 13m. Successful tests do not print out any information about the Dataflow Job to compare.
There are additional tasks performed by the _run_workcount_it method such as cleanup and publishing metrics to BigQuery. Further analysis of the cleanup and publishing to metrics only requires information about artifacts and metadata generated during the test, such as the Job Id, Google Cloud storage files, etc. Notably, there's a usage of an influx DB to read and then write to BigQuery.
Assessment
We can rule out any failing Dataflow Jobs as a root cause of the failure incidences. Moreover, there seems to be ~15m of extra work outside the Dataflow Job execution that is being executed within the test code. There seems like a lot of unnecessary coupling of after test functions with running the test.
Recommendations
Remove the after test clean up and consider using a Google Cloud storage wildcard approach to schedule a deletion of test artifacts outside test execution.
Remove the influx DB read and write to BigQuery. Perhaps use a scheduled batch or streaming Pipeline to collect these results into BigQuery.
The PerformanceTests WordCountIT PythonVersions is failing over 50% of the time.
Please visit https://github.com/apache/beam/actions/workflows/beam_PerformanceTests_WordCountIT_PythonVersions.yml?query=is%3Afailure+branch%3Amaster to see all failed workflow runs.
See also Grafana statistics: http://metrics.beam.apache.org/d/CTYdoxP4z/ga-post-commits-status?orgId=1&viewPanel=9&var-Workflow=PerformanceTests%20WordCountIT%20PythonVersions
The text was updated successfully, but these errors were encountered: