Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The PerformanceTests WordCountIT PythonVersions job is flaky #32144

Open
github-actions bot opened this issue Aug 10, 2024 · 3 comments
Open

The PerformanceTests WordCountIT PythonVersions job is flaky #32144

github-actions bot opened this issue Aug 10, 2024 · 3 comments

Comments

@github-actions
Copy link
Contributor

The PerformanceTests WordCountIT PythonVersions is failing over 50% of the time.
Please visit https://github.com/apache/beam/actions/workflows/beam_PerformanceTests_WordCountIT_PythonVersions.yml?query=is%3Afailure+branch%3Amaster to see all failed workflow runs.
See also Grafana statistics: http://metrics.beam.apache.org/d/CTYdoxP4z/ga-post-commits-status?orgId=1&viewPanel=9&var-Workflow=PerformanceTests%20WordCountIT%20PythonVersions

@damccorm
Copy link
Contributor

Looks like this is back to green

@github-actions github-actions bot added this to the 2.59.0 Release milestone Aug 14, 2024
@github-actions github-actions bot reopened this Nov 7, 2024
Copy link
Contributor Author

github-actions bot commented Nov 7, 2024

Reopening since the workflow is still flaky

@damondouglas damondouglas self-assigned this Dec 5, 2024
@damondouglas
Copy link
Contributor

damondouglas commented Dec 6, 2024

Unassigning myself but relaying my research on this ticket.

Situation

This workflow's test failed roughly every 2 to 3 days in the past two weeks.

Background

This workflow is scheduled to run twice daily. Recent inspection of the latest failures shows that a timeout (Failed: Timeout >1800.0s) when the actual Dataflow Job for that execution succeeded. The stack trace of each failure is not the same for the past two weeks' failures. In each build scans' timeline we see that :sdks:python:test-suites:dataflow:py39:runPerformanceTest takes approximately 30m cutting off at the configured timeout.

Said timeout is set on the runPerformanceTest gradle task per https://github.com/pytest-dev/pytest-timeout. Dataflow Jobs for these failed tests take approximately 10 to 13m. Successful tests do not print out any information about the Dataflow Job to compare.

There are additional tasks performed by the _run_workcount_it method such as cleanup and publishing metrics to BigQuery. Further analysis of the cleanup and publishing to metrics only requires information about artifacts and metadata generated during the test, such as the Job Id, Google Cloud storage files, etc. Notably, there's a usage of an influx DB to read and then write to BigQuery.

Assessment

We can rule out any failing Dataflow Jobs as a root cause of the failure incidences. Moreover, there seems to be ~15m of extra work outside the Dataflow Job execution that is being executed within the test code. There seems like a lot of unnecessary coupling of after test functions with running the test.

Recommendations

  • Remove the after test clean up and consider using a Google Cloud storage wildcard approach to schedule a deletion of test artifacts outside test execution.
  • Remove the influx DB read and write to BigQuery. Perhaps use a scheduled batch or streaming Pipeline to collect these results into BigQuery.

@damondouglas damondouglas removed their assignment Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants